Steps on the Path
- You MUST export metrics of your application
- You SHALL enable tracing in your application
- You MUST be thoughtful in writing information to logs
- You MUST build a basic dashboard to understand your metrics
Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.Wikipedia
One of the ideas of building a DevOps team is making sure that a system you build and support has enough logs, metrics, and traces available that you can understand what that system is doing.
Most teams have heard about and (hopefully) have instituted some kind of monitoring. They want to know if something goes bump in the night and be alerted to it before their users are aware of a problem. This is where implementing practices of Observability can come into play.
Think back to Physics class in school where you learned about the concept of a vector. Vectors are measurements which have magnitude and direction. An observable can be thought of as similar to a vector; they have a value (magnitude) and a trend (direction). Perhaps the number of hits to your website is something you want to monitor. This Metric (a type of observable) would have an instant value and a direction over time. Both are important in understanding the health of your system. Seventy-four requests per second might be a good scalar measurement in the context of your maximum throughput of 2000 requests per second. You’re well under your limit of capacity for your system. But if you compare it to yesterday’s or last week’s value at the same time period and see that normally you should have 420 requests per second, that Metric becomes a signal that something with your application is wrong. Both the instantaneous value and its relationship over time and possibly to other values (like latency) provide engineers a way to observe the system and act accordingly. Maybe a database is starved or your largest client is in maintenance mode. By building other Metrics into a dashboard, a DevOps team can understand whether there is a problem. Metrics are worth another separate post.
Sometimes an application can misbehave and a team does not want to show that behavior to their clients for a good reason. As a rule of thumb, engineers never want to leak exceptions back to UIs or API calls and prefer to show a “pretty” error message or an error code that can prompt a user or system to try again or choose a different circuit (more on that in a different post). At the same time, the team needs to know what is happening inside their application. On top of metrics that can show how many errors per second or the number of HTTP status codes being returned to clients, teams can also emit Logs to detail the errors happening inside the system. These Logs can help to diagnose something like a bad database query or malformed request to a downstream service. Logging takes finesse; the last thing a team wants is to build out a bunch of logging only to have it become useless because the error is buried in a stream of other messages. Choose carefully which exceptions to log, and combine those exception messages with enough data to diagnose an issue in a human-readable format.
Thirdly, the concept of Tracing has been gaining steam in the community. The CNCF recently announced the combining of OpenTracing and OpenCensus, two competing Tracing standards. Tracing is the act of following a request from system to system to understand its whole lifecycle. It blurs the boundaries between systems so that engineers are better able to understand the latencies that different backing services have on their request. If granular enough, it also allows teams to understand if an algorithm needs to be tuned. As an example, a point-of-sale system probably scans an item, looks up the price and tax, and displays it on the screen. When the next item is scanned, it does that again, until the user requests a total. When the user pays, their card is read, the data is encrypted and sent somewhere to be authorized, the payment is authorized, booked, and a receipt is printed for the customer. In those two scenarios (adding items to a cart, and performing a checkout) are very user-focused, so it is important for teams to understand how quickly those interactions with the user is happening. Does a particular scan gun model take longer to read the UPC? Does a type of credit card take longer to process than others? By capturing traces of interactions at this level of details, those who are able to tweak the interactions can better understand where gains can be made.
The three kinds of concepts I’ve covered here: Metrics, Logs, and Traces offer enough observability for any team to be able to actively manage and report against the performance of a system and possibly the happiness of a user. Happy users mean happy managers and keep you out of a War Room situation.