Note--Monitoring

Logs, metrics, and traces are often referred to as the "Three Pillars of Observability."

A robust monitoring system should clearly define what to monitor and in what units (metrics). It should also set threshold values for all metrics and be capable of alerting the appropriate stakeholders when values exceed acceptable ranges. Monitoring systems can assist support teams by collecting measurements, displaying data, and issuing warnings when something appears abnormal.

Logs

Logs are time-stamped records of discrete events/messages generated by various applications, services, and components in your system. They provide a qualitative view of the backend's behavior and can typically be split into three types:

Application Logs: Messages logged by your server code, databases, and other components.

System Logs: Generated by systems-level components like the operating system, disk errors, and hardware devices.

Network Logs: Generated by routers, load balancers, firewalls, and other network components. They provide information about network activity such as packet drops, connection status, traffic flow, and more.

Logs are usually in plaintext, but they can also be in a structured format like JSON. They are stored in a database such as Elasticsearch.

However, the issue with relying solely on logs is that they can be extremely noisy, and it can be challenging to extrapolate any higher-level meaning of the state of your system from them.

Severity levels:

DEBUG

INFO

WARNING

ERROR

FATAL/CRITICAL

Architecture

Requirements:

Writing Logs: The services of the distributed system must be able to write into the logging system.

Searchable Logs: It should be effortless for a system to find logs, and the application's flow from end-to-end should also be effortless.

Storing Logging: The logs should be stored in distributed storage for easy access.

Centralized Logging Visualizer: The system should provide a unified view of globally separated services.

Low Latency: Logging is an I/O-intensive operation that is often much slower than CPU operations. To ensure that logging does not impede application performance, it should be designed to run on a separate path. (Asynchronously lambda, keep Data in RAM)

Scalability: Our logging system should be able to handle increasing amounts of logs and a growing number of concurrent users.

Availability: The logging system should be highly available to ensure that data is logged reliably. (Add additional log accumulators);

Avoid:

Avoid logging personally identifiable information (PII), such as names, addresses, emails, and so on.

avoid logging sensitive information like credit card numbers, passwords, etc.

Excessive information should be avoided as it takes up unnecessary space and affects performance.

The logging mechanism should be secure and not vulnerable, as logs contain the application’s flow and an insecure logging mechanism is vulnerable to hackers.

Components:

Log Accumulator: An agent that collects logs from each node and stores them in a centralized location. This allows us to quickly access logs from any node without having to visit each node individually.

Storage: The accumulated logs must be stored somewhere. We will use blob storage to save our logs.

Log Indexer: As the number of log files increases, it becomes more difficult to search through them. The log indexer will use distributed search to make searching more efficient.

Visualizer: The visualizer provides a unified view of all the logs.

Filterer: It identifies the application and stores the logs in the blob storage reserved for that application, as we don't want to mix logs of two different applications.

Error aggregator: It is critical to identify an error quickly. We use a service that picks up the error messages from the pub-sub system and informs the respective client, saving us the trouble of searching the logs.

Alert aggregator: Alerts are also essential, so it is important to be aware of them early. This service identifies the alerts and notifies the appropriate stakeholders if a fatal error is encountered, or sends a message to a monitoring tool.

Metrics

It is important to define what to measure and which units to use in order to gain insight into the state of a system at any given time. Common metrics include requests per second, latency, cache hits, and RAM usage. These metrics provide a quantitative view of the backend, including factors like response times, error rates, throughput, and resource utilization. They are typically stored in a time series database and can be used for statistical modeling and prediction. Calculating averages, percentiles, correlations, and other metrics can be especially useful for understanding system behavior over time.

However, both metrics and logs have their limitations. They are scoped to individual components or systems, meaning that if you need to understand what happened across the entire system during the lifetime of a request that traversed multiple services or components, you will need to do some additional work to join together logs and metrics from multiple sources.

Monitor Server-side Errors

Server-side errors are often easier to identify and address as the service has more visibility into the server environment. Monitoring systems can help identify trends by tracking CPU usage, memory usage, page load times, and throughput. Alerts can be set up to notify the appropriate stakeholders when values exceed the defined thresholds. Additionally, monitoring systems can be used to identify and debug errors, as they can provide detailed information about the state of the system at the time of the error.

Conventional approaches to handle failures in IT infrastructure

Approach	Reactive	Proactive
Definition	Taking corrective action after failure occurs	Taking preventive action before failure occurs
Result	Causes downtime	Prevents downtime and associated losses
Action	Handle failures after they occur	Predict and prevent failures before they occur
Reliability	Offers lower reliability	Offers better reliability
Goal	React quickly to minimize downtime	Find impending problems early on and design systems to make service faults invisible to end users

Collecting Method

Pull-based logging, the monitoring system actively requests logs at regular intervals.

Push-based logging, the monitored systems push logs to the monitoring system. The monitoring can be near real-time. However, each microservice sends its metrics to the monitoring system, resulting in a high volume of traffic on the infrastructure.

Monitor Client-side Errors

Errors caused by the client's system are difficult to address, as the service has limited visibility into the client's environment. We may attempt to identify a decrease in load compared to the average, but this can be difficult. False positives and false negatives may occur due to factors such as variable load or if only a small portion of the client base is affected.

The client side can try various collectors in different failure domains until one works. For last-mile errors, there is not much that can be done as a service provider. Such events can be accumulated at the client side and reported on the next connectivity. A service can influence the remaining component failures.

Design with two components:

Agent: This is a probe embedded in the client application that sends service reports about any failures.

Collector: This is a report collector independent of the primary service. It is made independent to avoid situations where client agents want to report an error to the failed service. We summarize error reports from collectors and look for spikes in the errors graph to identify client-side issues.

Protect user privacy

For a browser-based client, we can avoid the following information:

Traceroute hops to the service path, as users can be susceptible to their geographic location, which is akin to collecting location information.

Details of DNS, which can leak information about the location.

Round-trip-time (RTT) and packet loss information.

Metrics in front end

Time to First Byte (TTFB)

TTFB (Time To First Byte) measures the responsiveness of your web server. It is the time elapsed between the user making an HTTP request for a resource and receiving the first byte of the response. Most websites should strive to have a Time To First Byte (TTFB) of 0.8 seconds or less. Values above 1.8 seconds are considered poor.

Your Time To First Byte (TTFB) is composed of:

DNS lookup

Establishing a TCP connection (or QUIC if it's HTTP/3)

Any redirects

Your backend processing the request

Sending the first packet from the server to the client

Mitigation methods:

Utilizing caching (CDN)

Edge computing(serverless functions in Cloudflare Workers)

First Contentful Paint (FCP)

First Contentful Paint (FCP) measures how long it takes for content to start appearing on a website. Specifically, it measures the time from when the page begins loading to when a part of the page's Document Object Model (DOM) – such as a heading, button, navbar, or paragraph – is rendered on the screen. A good FCP score is under 1.8 seconds, while a poor FCP score is over 3 seconds.

In the example Google search above, the first contentful paint happens in the second image frame, where the search bar is loaded.

Having a fast FCP means that users can quickly see that content is starting to load, rather than just staring at a blank screen.

Mitigation methods:

Removing any unnecessary render-blocking resources (removing any unnecessary render-blocking resources)

Downloaded JS should be marked with async or defer attributes

The unimportant part of the CSS should be deferred.

Largest Contentful Paint (LCP)

LCP (Largest Contentful Paint) measures how quickly the main content of a website loads. The main content is defined as the largest image or text block visible within the viewport. A good LCP score is under 2.5 seconds, while a poor LCP score is over 4 seconds.

Similar to FCP (First Contentful Paint), improving LCP mainly comes down to removing any render blocking resources that are delaying the largest image/text block from being rendered; load those resources afterwards. If your LCP element is an image, then the image's URL should always be discoverable from the HTML source, not inserted later from your JavaScript. The LCP image should not be lazy loaded and you can use the fetchpriority attribute to ensure that it is downloaded early.

First Input Delay (FID)

FID (First Input Delay) measures the time from when a user first interacts with your web page (e.g. clicking a button, link, etc.) to when the browser can begin processing event handlers in response to that interaction. It measures how interactive and responsive your website is.A good FID score is 100 milliseconds or less, while a poor score is more than 300 milliseconds.

Mitigation methods:

Minimize the number of long tasks that block the browser's main thread.

Break long running task to smaller ones, and eliminate any unnecessary JavaScript with code splitting

Time to Interactive (TTI)

Time to Interactive (TTI) measures how long it takes for a web page to become fully interactive. TTI is measured by starting from the First Contentful Paint (FCP), when content first starts to appear on the user's screen, and then waiting until the criteria for fully interactive are met. A good TTI score is under 3.8 seconds, while a poor score is over 7.3 seconds.

Fully interactive is defined as:

The browser displaying all of the main content of the page

No running Long Tasks

Event handlers registered for visible page components

Mitigation methods:

Find any long tasks and split them, so you are only running essential code at the start.

Total Blocking Time (TBT)

Total Blocking Time (TBT) measures how long your site is unresponsive to user input when loading. It's measured by looking at how long the browser's main thread is blocked from the First Contentful Paint (FCP) until the page becomes fully interactive (Time to Interactive, or TTI).A good TBT score is under 200 milliseconds on mobile and under 100 milliseconds on desktop. A bad TBT is over 1.2 seconds on mobile and over 800 milliseconds on desktop.

Mitigation methods:

Identifying the cause of any long tasks that are blocking the main thread. Split those tasks up so that there is a gap where user input can be handled.

Cumulative Layout Shift (CLS)

CLS measures how much the content moves around on the page after being rendered. It is meant to track unexpected layout shifts that occur after the page has loaded, such as a large paywall popping up four seconds after the content loads.

Definition:

The impact fraction measures how much of the viewport's space is taken up by the shifting element

The distance fraction measures how far the shifted element has moved as a ratio of the viewport

Note: CLS only measures unexpected layout shifts; if the layout shift is user-initiated (e.g. triggered by clicking a link or button), it will not negatively impact the CLS score.

Mitigation methods:

include size attributes on images and video elements, so the browser can allocate the correct amount of space in the document while they are loading.

find potential problem through Layout Instability API

Interaction to Next Paint (INP)

Interaction to Next Paint (INP) measures how responsive the page is throughout the entire page’s lifecycle, where responsiveness is defined as providing the user with visual feedback. A good INP is under 200 milliseconds, while an INP above 500 milliseconds is considered bad.

Definition:

This delay is measured over the entire page lifecycle until the user closes the page or navigates away.

Interaction to Next Paint is currently an experimental metric and may replace First Input Delay in Google’s Core Web Vitals. For more details, check out the web.dev site by the Google Chrome team.

Tracing

Distributed traces allow you to track and understand how a single request flows through multiple components/services in your system.

To implement this, identify specific points in your backend where there is a fork in execution flow or a hop across network/process boundaries. This can be a call to another microservice, a database query, a cache lookup, or similar.

Assign each request that comes into your system a UUID (unique ID) so you can keep track of it. Add instrumentation to each of these specific points in your backend so you can track when the request enters/leaves (the OpenTelemetry project calls this Context Propagation).

You can analyze this data with an open source system like Jaeger or Zipkin.

If you would like to learn more about implementing traces, you can read about how DoorDash used OpenTelemetry here.

Java

public GetProductInfoResponse getProductInfo(GetProductInfoRequest request) {

  // Which product are we looking up?
  // Who called the API? What product category is this in?

  // Did we find the item in the local cache?
  ProductInfo info = localCache.get(request.getProductId());
  
  if (info == null) {
    // Was the item in the remote cache?
    // How long did it take to read from the remote cache?
    // How long did it take to deserialize the object from the cache?
    info = remoteCache.get(request.getProductId());
	
    // How full is the local cache?
    localCache.put(info);
  }
  
  // finally check the database if we didn't have it in either cache
  if (info == null) {
    // How long did the database query take?
    // Did the query succeed? 
    // If it failed, is it because it timed out? Or was it an invalid query? Did we lose our database connection?
    // If it timed out, was our connection pool full? Did we fail to connect to the database? Or was it just slow to respond?
    info = db.query(request.getProductId());
	
    // How long did populating the caches take? 
    // Were they full and did they evict other items? 
    localCache.put(info);
    remoteCache.put(info);
  }
  
  // How big was this product info object? 
  return info;
}

Each service must record a trace ID and share it with other collaborating services. Collecting the instrumentation for a given trace ID can be done either after the fact or in near real-time using a service like AWS X-Ray.

While metrics might help us rule out causes or highlight an area of investigation, they don’t always provide a complete explanation. For example, we might know from metrics that errors are coming from a particular API operation, but the metrics might not reveal enough specifics about why that operation is failing. At this point, we look at the raw, detailed log data emitted by the service for that time window. The raw logs show the source of the problem—either the specific error or aspects of the request triggering an edge case.

At Amazon, measurements in the application aren’t aggregated and occasionally flushed to a metrics aggregation system. All of the timers and counters for every piece of work are written in a log file. From there, the logs are processed and aggregate metrics are computed after the fact by some other system. This way, we end up with everything from high-level aggregate operational metrics to detailed troubleshooting data at the request level, all with a single approach to instrumenting code. At Amazon we log first, and produce aggregate metrics later.