Monitoring Design for Critical Applications

Critical applications that need <5 minute recovery times for a wide range of failures need special monitoring. Many teams may not realize their monitoring package could fail due to the same underlying issue that impacts their application. They may not realize their metrics cannot detect and alert them to partial failures. This article will explain trade-offs and design recommendations for monitoring critical applications.

Page content

Designing a monitoring solution for critical applications can be a little tricky, especially if your failure recovery automation depends on monitoring for detecting and responding to failures. Critical applications are those that need to hit 99.999% or better uptimes, are sensitive to error rates of 5% or lower, need to recover from a wide-range of failures in less than 5 minutes, or must be resilient against unusual major failures like a total unrecoverable loss of a datacenter.

The guidance in this article will help you design monitoring for critical systems you can use to detect and recover from failures in less than 5 minutes. The recommendations focus on issues that matter for critical applications and for metrics specifically used to trigger automated failure mitigation, like deployment rollback, removal of impaired application instances, datacenter failover or client isolation. The article will also cover considerations for monitoring systems themselves.

This article builds on recommendations on monitoring and observability you can find in the Amazon Builder Library. Monitoring production services at Amazon describes best practices for maintaining metrics that detect poor client experience in your application. The presentation describes methods for dealing with metric noise and a number of other problems that can lead to false positives or false negatives in your alarms. Instrumenting distributed systems for operational visibility provides specific recommendations on instrumenting your application and includes things that can prevent monitoring from causing failures in your application while ensuring you have sufficient data to troubleshoot failures when they occur.

Instrumentation

Automated mitigation first starts with metrics that detect customer impacting failures or other system failures. The metrics below are a good baseline for most applications.

Error Rates

Application error rates are a useful way to detect a wide range of failures. For example, error rates can increase as a result of a problematic deployment or due to a host that has run out of memory. If you can get them, client-side error rate metrics are effective in detecting client-to-application reachability failures that can occur due to network failures or misconfigurations between the client and the application. Per the guidance from Instrumenting distributed systems for operational visibility, emit 0 value metrics on successful request and 1 value metrics for every failing request. Then you can use an average aggregation function on a time interval of your choosing. Separate out error codes you know are due to service faults (5xx errors) from things like invalid requests (4xx errors), which usually mean clients have made a mistake in their requests. You’ll want both, but you’ll need them separated to allow for setting different alarm thresholds.

Application error rates are great for detecting failures, but they only work so long as the application is running. The application must log, then push the metric to a monitoring platform. If the application is crashing, the host dies or the logs cannot be processed, the metric won’t be available to alarm on. Client-side metrics or health checks can act as a redundant system to measure application error rates. Client-side metrics and external health checks don’t depend on application functionality to measure application health.

Alarm thresholds for service-side errors like 5xx errors should be no greater than 1% to 5% error. It is possible for some applications to operate at something close to 0 5xx errors at all times, in which case you can set an alarm threshold on absolute error counts to detect very low grade failures.

Client side error alarm thresholds should be set based on nominal application baselines or by using some sort of auto baseline anomaly detection. Some applications can see very high nominal client error rates of 70%, if just a few clients get overexcited, sending invalid requests. It’s ok to to have a high client error rate, but it is still a good idea to set an alarm threshold even if it is >80% or higher. If you ever make a change to your validation logic and suddenly 100% of requests are considered invalid, you’ll need this alarm to detect the failure.

Elevated latency

Step-function latency increases in operations can indicate a failure. Network infrastructure failures can lead to packet loss, which in turn can affect request latencies. You’ll want something more than an average aggregation, percentile aggregation at p50, p90 and p95 can provide good metrics to detect failures in infrastructure or performance regressions caused by problematic configuration.

Application latency metrics can’t measure the network stuff between your client and your application that may be impacting latency. Client-side metrics can clue you into performance problems anywhere between your service and the client.

Applications with consistent latency can use fixed alarm thresholds, otherwise anomaly detection might be needed to detect failures while preventing false positives.

Healthchecks

I’m using the term “Healthchecks” to include smoke tests, canaries, deep-health checks or shallow health checks. Any synthetic request generator, running external to application host qualifies.

If you are using a load balancer or kubernetes, you are using health checks. Load balancers and app-instance management platforms like kubernetes perform regular health checks to each of your application instances. Most load balancers allow you to specify the endpoint to test, you can customize that endpoint to perform additional internal checks before returning a health response. You can also instrument your own health checks using something like Route 53 Heathchecks or CloudWatch Synthetics.

Health Checks operate outside of application hosts, so even if the application crashes or the application host is unavailable or unreachable, the health check can gather and publish metric data. Health checks are a great solution for redundant monitoring.

Health checks minimize false positives by going into a failed state only after some configurable period to consecutive requests have failed. This prevents one-off failures from taking capacity offline.

You can calculate the time to detect failure by understanding the health check configuration. Consecutive failures x polling interval = time to detection of failure. The default for Route 53 health checks to detect failure is 30 seconds with an option for fast health checks that works in 10 seconds. The default for Application Load Balancer detection is 60 seconds, 2 consecutive failures with 30 second check intervals. You can configure that down to 5 second intervals for 10 seconds to detect.

Health checks are prone to false negatives for intermittent failures and elevated error rates. Most health checks can only reliably detect very high error rates of >80% because the polling interval and number of checks is too sparse. You can detect low level error rates by increasing the sample rates of some types of health checks, like CloudWatch Synthetics, to >100 checks per poll interval and counting n of m failures using custom code and by. For example, if you detect >5 total failures out of 100 checks in an interval, you can call that a failure and detect fault rates of 5%. This might seem like wastefully redundant checking and it may be if the check is going to the same host over the same network connection. Creating a new connection on each of the 100 checks means you are likely to traverse a different set of network devices and connect to a different host in your fleet. However, checking the same host over and over is exactly what you want to do, if you need to pinpoint a specific application instance generating low-grade error rates.

In my experience, a single host generating low-grade errors is the less common scenario. More common is a single host, a network path, a network device or a group of these things are generating high error rates for some application functionality. However, the net effect is a low measured error rate because all of the other healthy devices or healthy application functions dilute the small number of unhealthy instances leading to low measured error rates. Because the low-error single host is less common, you don’t need to go all-in on 100 samples per host per 1-minute period, but I do recommend measuring all of your application instances in aggregate at 100 samples per 1-minute period.

Request rates

Request rates measure the number of calls your system is receiving or generating. Step-function drop offs can clue you in to system failures, including reachability issues between client and application or application and dependencies. For example, if some kind of large scale Internet event prevents client networks from connecting to your service, you’ll see a big drop in requests, but no error rate increase. Step function increases in request volume can clue you into Client-Originated failures. For example a flash crowd event where many clients send more requests, or a single client sending a very large volume of requests, or a DDoS attack from malicious clients.

Request rates are prone to noise, false positives and false negatives compared to the other metrics. Alarms that make use of some form of anomaly detection can make alarms on this type of metric actionable.

Work Backwards from Mitigation to Set Metric Dimensions

What happens after your monitoring systems detect an issue? If you’re targeting ~5 minute recovery times, then your answer should be “triggering automated mitigations”. If you are triggering automated mitigations, then your metric must be aligned to whatever automated mitigation you are using.

Let’s start with a commonly used form of automated failure mitigation, host removal by load balancers to mitigate host failures. Load balancers health check and track the health of each individual application instance and remove individual application instances as they fail. There’s a one-to-one relationship between the alarm and the alarm action. You should follow this one-to-one pattern when creating metrics for automated mitigations.

If you want to detect impaired AWS Availability Zones (AZs), and remove all application instances in an impaired AZ from service to mitigate the failure, then you need per-AZ metrics and alarms to detect the failure. Those alarms must be linked to automation that removes the AZ from service.

If you want automated rollbacks, then you need metrics that identify the particular change that needs to be rolled back. The kinds of metrics you need depend heavily on the parameters the rollback mitigation needs. Is there one rollback button per application deployment pipeline? If so, you only need an alarm per application with automation that triggers the rollback. Does your rollback need a version number to rollback? If yes, you’ll need an alarm that identifies the problematic version.

If you want to automatically mitigate problematic individual clients, you’ll need metrics that identify the client. If your automation blocks or reroutes the client, you might need the client IP address, which means you’ll need some kind of per-client IP alarm.

Working backwards from mitigation is shorthand for designing your metrics and aggregation points for the mitigation. You’ll need to emit metrics that align with the scope of your mitigation and whatever parameters it needs to work.

You must also consider timing goals and threshold goals for your mitigation. Let’s start with a look at timing considerations in the next section.

Set Aggregation Periods and Alarm Thresholds

Set Aggregation Periods Based on Recovery Time Objectives Make sure your metric aggregation periods align to your overall recovery time objective. For example, if you need a 5 minute recovery time, you will need to use something other than 5-minute metric aggregation. Depending on how long you need for your mitigation action, you might need something more responsive than 1-minute metric aggregation. It is possible that your monitoring platform can generate an alert without waiting for the aggregation period to fully close, but it’s better to be conservative and assume all of that time is coming out of your overall recovery time budget. So a 1-minute aggregation means you are down to 4 minutes left for all the other things that need to happen to meet a 5 minute recovery time objective.

Budget for Metric and Alarm Processing Time Don’t forget to budget for metric and alarm processing time. The logging agent running on each host likely pushes batches of data on some interval that goes to a storage area or a queue and is picked up by some other polling process. Then, metric data is sorted and aggregated and transformed. Some other process reads the data and compares it against configured alarm thresholds. Then, a component triggers the alarm, which takes still more time. The time it takes from onset of failure to triggered action is something you should test. For CloudWatch, this processing normally adds 10s of seconds, but not minutes of additional time. However, bear in mind that this time may increase if CloudWatch experiences impairments. You can mitigate failures of your monitoring stack by using a redundant monitoring stack, discussed further down in this article.

Low-Grade Error Detection Require More Samples If you need to detect intermittent failures or error rates as low as 5% you need to pay close attention to your metric sample rates. I recommend about 100 data points per aggregation period to minimize false negatives. If your mitigation is a granular metric, like a metric per host or per client, you might not have the natural request volume to generate that many data points for each client and host. You can increase your aggregation period from 1 minute to 5 minutes to increase your samples by 5x, but this will also increase your overall recovery time by 5x. For some metrics, you can make up for low volumes by synthesizing requests with something like CloudWatch sythetics, Route 53 health checks or your own custom polling agent.

Prepare for Failures that Impact Monitoring

Failures can impact your monitoring infrastructure. Monitoring systems are distributed systems too and are subject to the same failures as your application. Monitoring systems can fail at the same time as the applications they monitor, because they sometimes share common infrastructure and common dependencies with the applications they monitor. Failures in monitoring systems can prevent application recovery by breaking automatic mitigations or by causing your operations team to “fly blind” trying to troubleshoot an issue without the aid of metrics to help them assess impact and diagnose failures. You can design mitigations to ensure monitoring system failures don’t leave you without an ability to recover.

Identify failure risks in your monitoring system The diagram below shows a typical metric and alarm processing pipeline. Each box represents a component that itself could be a large complex distributed system. There are opportunities for failure all along the way. A full disk on an application host can prevent new metrics being written or published. A full disk could cause application errors.

Metric Processing Pipeline

Oh the irony when the monitoring system is the thing that fills the disk that causes the host impairment that prevents an alarm letting operators know there is a problem.

The question to ask yourself is, what happens when failures occur? What happens if the agents on the host can’t publish their metrics? What happens if your monitoring stack fails? Do your automated mitigations, like your load balancer solution or your auto scaling systems break? Will you know an impact is happening if your metric dashboards are inaccessible?

This is where redundancy can help. Notice that the health check agent in the diagram can run on a separate host from the measured application and can publish metrics to an entirely different monitoring stack. That’s just one way to build redundant monitoring. There are a number of ways, listed below, that can reduce monitoring failure risks by limiting critical dependencies or by adding redundancy.

Design monitoring for partition tolerance Be intentional about where you place your monitoring stack relative to the monitored application. There isn’t one right answer on whether you should collocate your monitoring stack with your application or place your monitoring stack on separate infrastructure. I prefer a bit of both for passive monitoring, but the trade offs get much trickier when you’re using metrics for mitigation actions. Do you want your system to pull capacity offline if monitoring gets partitioned off and can’t access your system, even though you clients may be able to? There is a trade off to consider, particularly for partitioning failures when networks fail and prevent hosts in one segment of the network from communicating with another segment.

Partitioning events can prevent a monitoring agent from measuring the health of an application instance, even if that application instance is perfectly fine. A partition event can prevent metric publishing, if the application or the external monitoring agent can’t reach the monitoring stack. Finally, mitigation automation or human operators may not be able to take corrective actions if a network partition prevents them from reaching the monitoring stack.

The diagram below describes each of these monitoring roles. For any particular failure/recovery action, consider which partition these roles live in. What kinds of failures would partition one role off from another. For example, if a single datacenter in your system partitioned off which roles would be partitioned off from other roles? When this happens will your metrics report truthful information? Can you relocate the roles inside or outside of that datacenter to achieve a more desirable result?

Roles in Monitoring Systems

If your monitoring system puts external monitoring agents into a different network than the actual clients, it is possible for a failure to affect the health check, but not the client. If this occurs, the health checks will fail and we will receive a false positive alert. That, in turn, may kick off an automated mitigation that wasn’t needed. In this scenario, having redundant client-source or application-source metrics could clue in an operator that the application and clients are fine, even though health checks are failing.

Roles in Monitoring Systems

This process of designing with partitioning in mind will ensure you are prepared for failures that can impact your monitoring. Once you identify a few key failure modes, you choose from mitigations below to prevent a total loss of monitoring and reduce false positives and false negatives in alerting.

Use external health checks or canary tests for redundancy When your health checks run on a host separate from the application they monitor, they can detect failures even if things like filled hard drives prevent on-host metric agents from doing their job. Remember that health checks can experience false positives and false negatives for two reasons. First, healthchecks probably don’t run in all of your clients networks, especially if your clients are coming in from all over the Internet. If your clients are other services, then you should place your health checks in the same networks as your clients, if you can. Second, your health checks aren’t your clients, they probably aren’t performing all of the actions your clients are. You can choose to test every function of your application, but this might result in health checks that take as long to execute as your full suite of regression tests. Ideal health checks can alert you to failures in <1 minute, so you will need to choose your health check validations carefully and ensure each iteration completes in <10ms to get 100 samples in a 1 minute period. Client-sourced metrics can help fill in the gap here.

Use client-sourced metrics for redundancy Client metrics are great, if you can get them. They address some of the shortcomings of health checks because they run in all of your client networks and monitor the service functions your clients are using. However, they are complex to set up. They only work if you have a good number of active clients always using your service. This means if the clients aren’t testing all of your functions all of the time, you might not detect a failure. If this occurs, you’ll need to supplement with synthetic testing using some kind of external health check.

Use independent redundant metric pipelines Redundant measurement is great, but if everything pushes all the metrics to the exact same monitoring stack, a failure of that monitoring stack will leave your team blind and your mitigating actions inert. If you’re using AWS, setting up CloudWatch in two different Regions will ensure that even if CloudWatch experiences an issue on one Region, you can use the other Region to diagnose and mitigate application failures. The diagram below depicts this with the gray boxes to process health check agent data.

Redundant Monitoring Stack

Route 53 health checks use lots of redundancy between its health checks, which run in 8 Regions, and the DNS data plane which runs in >100 edge locations world wide. However, if you use the health check metrics and alarms for mitigation actions, these metrics are published through a single-Region CloudWatch monitoring stack located in us-east-1. A failure of the monitoring stack in us-east-1 could leave you without metrics until the stack recovers. If you’re using health checks directly within your DNS routing policies, you don’t need to worry, the default level of redundancy will continue to work even in the extremely unlikely event that two Regions experience simultaneous impact.

Collocate monitoring and actions to reduce dependencies The pipeline above has lots of layers that can fail. You can remove dependencies by keeping the monitoring pipeline and mitigation automation very simple and colocated. This is how hardware load balancers operate. Each load balancer routing device has a process that continuously checks each back end instance. It also has its own local health check metric aggregators, and local automated mitigation actions to take unhealthy back end hosts out of service.

Prepare Flying Blind Dashboards and Procedures Certain large scale events will leave your team blind. If you have built metric redundancies, you can prepare dashboards and procedures anticipating things like a monitoring stack failure. For example, if you run with two redundant monitoring stacks in two different Regions as shown in the diagram above, you can create dashboards anticipating that one or the other Regional metrics will be missing. You can build procedures or add notes inline with your dashboard, clueing your operators into the type of Regional failure that could cause loss of a particular metric and redirect them to metrics hosted in the other Region. You should also be prepared to log into application hosts directly to check logs to assess application health, or to mitigate failures if your automated mitigations are left inert due to a failure of the monitoring stack.

Design for Race Conditions in Monitoring If you use redundant metrics, whether or not you run into multiple Regions, you need to prepare for race conditions, delayed metrics and multiple alarms for single events. Automatic mitigations that trigger based on alarms must be designed to detect and resolve these kinds of race conditions or you may end up triggering too many or the wrong kind of mitigations. For example, a mitigation that removes unhealthy hosts could end up removing too many hosts, starving your clients of critical capacity.

Route 53 Application Recovery Controller routing controls and zonal shift are designed to handle race conditions to limit how many of your application instances can be automatically taken out of service. You can also implement your own distributed systems coordinator using something like etcd, Zookeeper, or your own custom datastore.

Test Your Metrics and Metric-Driven Mitigations

Metrics, alarms and automated mitigations driven off of those alarms are complex. The only way to be confident they work is to see them work on a regular basis. Monitoring production services at Amazon describes processes Amazon uses to review their metrics and alarms on a weekly basis to ensure they accurately represent the customer experience. Testing is another good best practice. You can use fault injection techniques using something like AWS Fault Injection Simulator or an open source tool like Gremlin.

You can go a long way by directly modifying your application code with a few lines of code that randomly returns an error code. I’ve found this method to be the easiest way to test automated alarms and mitigations. I want to respond to low-grade error rates at or below 5%. Most fault injection frameworks concentrate on host failures, which are also useful, but aren’t the frameworks aren’t as good at simulating intermittent or low-grade failures.

Once your application is instrumented with random failure, you can turn up the error rate incrementally to validate your alarm thresholds work as expected. You can also measure the time to detect and time to mitigate, because you’ll know exactly when you activate the failure and can compare that to the time it takes for the alarm to fire. You can use your error rate metrics to calculate the overall time to mitigate a failure, if you are using automated mitigations.

If you are using redundant monitoring stacks, be sure to disable one test, then disable the other and test to ensure both are configured properly to alarm and trigger any automated mitigations.

Conclusion

Critical applications need specialized monitoring. The techniques described in this article can get you started on designs that can protect you against large scale events that cause impact to your monitoring stack. Metric designs described in this article will detect and respond to low-grade and intermittent failures in support of automatic mitigations for RTOs of 5 minutes or lower. Check back for specific examples of monitoring solutions using open source tooling and AWS services.

Do you like what you see? Do you disagree? Would you like to share some specific examples of how you use the concepts discussed in this article? Please let me know using the Disqus comment box below.