Code & Config failures usually occur around a change. It might be a code deployment or a manual configuration change. Many organizations rely on rollback procedures to mitigate failures when problematic changes occur. If you are targeting 99.999% uptime or recovery times of less than 5 minutes, then a rollback isn’t an ideal mitigation. Rollbacks won’t reliably mitigate change related failures on <5 minute timelines.
This article explains why rollbacks are too slow and unreliable to act as a primary mitigation mechanism for 99.999% SLAs. Then, the article will explain alternative mitigations that can meet <5 minute recover time targets.
Rollbacks Don’t Meet the Bar
Rollbacks are excellent for any business pushing live deployments and for those following good Continuous Integration/Continuous Delivery (CI/CD) practices. However, they aren’t fast enough or reliable enough to mitigate change related failures for systems that need the highest levels of resilience.
First, rollbacks can be slow. Deployments that involve launching new compute instances, downloading containers, and configuring applications can take 3 minutes to 10 plus minutes to complete. If your rollback uses these same steps and we budget 2 minutes time for metrics and alarming to detect the failure and trigger the rollback, then your rollback is already taking too long to mitigate an impact in less than 5 minutes.
Second, you must quickly identify the problemtic change. Most organizations lack systems that automatically correlate a change to a problem. I’ve been on many live site calls with engineers asking “what changed?". If you are asking this question, then automated rollbacks won’t work for you and you won’t achieve 5 minute recovery time. It makes sense that the correlation problem is difficult. A deployment to one application can cause a problem for an entirely different application, making it very difficult to correlate a change to a failure in complex environments.
Third, not all changes run through an automated pipeline and some changes may not be tracked. Manual changes, for example, could be made and not tracked. This leaves a team asking people around the organization, “What changed?” This process easily leads to 1+ hour mitigation times, far too long for applications that need the highest levels of resilience.
Don’t use these shortcomings as a justification for getting rid of your change management processes and rollback automation. Rollbacks are still important, you should track changes to your system. You will need to eventually rollback the problematic change, even if you use another method to mitigate impact faster.
If you have a solution for the change-to-failure correlation problem, please contact me or leave a comment below. I’d like to hear about your solution.
Isolate Code & Config Failures
You are probably already using an incremental deployment solution that deploys your code live to a subset of your redundant application instances. Just like the diagram below, our application has three instances. Only one instance has the problematic change. We can mitigate the failure by removing the single afflicted host from service with a quick update to our load balancer.
If you use error rates or application health for load balancer routing decisions, the impaired host removal process will happen quickly and automatically. Many load balancers remove unhealthy instances in ~1 minute from onset. There are many trade offs to consider when using deep healthchecks for detecting bad changes rather than a more typical healthcheck configuration that detects only hard-down infrastructure failures. You can go deeper with this AWS Builder Library article. Check back for a future article detail how to implement a fully working solution. The rest of this article will cover the pertinent concepts for a better-than-rollback mitigation system.
You can improve failure isolation beyond simple incremental host deployments in a few ways. First, you can use “Phased deployments”, where your application instances are sub-divided into logical partitions. If you are deploying within AWS, grouping your application instances and deployments into Availability Zone (AZ) boundaries is an effective approach. AWS Provides multiple AZs per Region, each AZ is designed to operate with independent datacenters and networks from other AZs in the same AWS Region. Using AZs as logical application partitions, you can deploy changes incrementally to a single AZ “phase” at a time. This means that when a bad change is deployed, it will only impact application instances in a single AZ.
Using phased deployments is great for isolating changes that directly impact an application, but unisolated failures can still occur if dependencies fail. Imagine we are updating an API. We deploy to a single AZ, and all application instances pass their tests and health checks. However, an upstream service is suddenly impacted. All of these application instances experience elevated error rates because requests are equally load balanced across three AZs and one AZ is unhealthy. You can isolate dependent application failures to a single AZ by siloing. With siloing, upstream applications depend only on other application instances located within the same AZ. This comes with a trade off, managing capacity gets a little more complicated and you get fewer points of recovery in your overall system. The benefit of siloing is that a wider range of change related failures and infrastructure failures are isolated to a single AZ instead of expanding into whole-system impact.
You can also isolate failures caused by from time bomb type Config & Config changes by using AZ-phased fleets. Things like certificates, security credentials or other things that have an expiry time can be deployed to AZ-scoped fleets with jittered expiry. For example, you can generate one SSL certificate per AZ, each with an expiry time staggered by 1 week. If you forget to deploy an updated cert, the expiry will only impact one of your AZs and will leave you with enough time to regenerate all of other before they expire.
Making changes to one AZ at a time is a rule that is relatively easy to remember. Even when your engineers make manual changes, following the one AZ at a time rule can go a long way toward isolating failures to a single AZ.
Mitigate Isolated Code & Config Failures
Automatically Remove Impaired Paritions With failures isolated to a single AZ, you can mitigate client impact by removing the problematic AZ from service. This approach dramatically simplifies the correlation problem. You don’t need to know which change caused the problem, you only need to know which AZ is experiencing elevated error rates.
An easy to implement automated solution can detect an impaired AZ and remove it from service in less than 5 minutes. CloudWatch alarms, Lambda scripts and Route 53 Application Recovery Controller can be linked together with minimal code. You can also build your own solution using DNS primitives to control which AZs are in and out of service.
If you use different cloud providers or different load balancers, the concept is the same. Isolate the scope of your changes so they impact at most one logical partition. Use automated mechanisms to detect and remove impaired partitions from service to recover from change related failures in less than 5 minutes.
You can reuse this approach to protect your application from infrastructure failures when you align logical partitions with physical infrastructure partitions, like an Availability Zones or a Datacenter.
Retry Client Requests Using a Different Partition For even faster recovery, you can build Deterministic Retry. Typical retry logic in a client application will retry a request if it encounters certain types of errors. This retry request is randomly load balanced across all hosts. There’s a chance it could land right back at the same impaired partition and will encounter the same failure as the initial request. With Determinstic Retry, the second request selects a different partition. This second request will succeed when failures are isolated to a single AZ. Instead of a recovery time of minutes, each impacted request can recover on the first retry attempt. For many web applications, this could be as fast as a couple of seconds.
Maintain Sufficient Redundant Capacity
When an impaired application partition is removed from service, client workloads shift to the remaining in-service application partitions. In our example, we’re using three AZs as application partitions and we isolate problematic changes to a single AZ. When problems occur, we automatically remove a single impaired AZ from service. The remaining two AZs must have sufficient capacity to handle the subsequent increase in workload.
We can determine the expected workload increase by looking at the ratio of out of service to in-service AZs. In our example, we have 1:2 out of service to in-service AZs, which means we need 50% spare capacity in every AZ. If we had four partitions and removed one, we’ll have 1:3 out of service to in-service and will need 33% spare capacity. If we want enough spare capacity to remove two AZs, the ratio method still works. For example, with two out of service and five in-service (2:5), we’d need 40% additional capacity in each AZ.
Be careful when using autoscaling mechanisms to deal with the increase in workload. Autoscaling may or may not work, depending on your target recovery times and your tolerance for elevated error rates. If autoscaling reacts too slowly, your application may become overloaded leading to application error rates. For the fastest recovery with little to no error rates, prescale your spare capacity using the ratio method above.
If you don’t prescale and plan to use autoscaling, be aware it can take ~10 minutes to scale up your fleet. AWS Autoscaling warmpools can decrease scaling times to just a few minutes, but during this time your application may be overloaded immediately after you remove a partition from service.
Your default autoscaling step-up policy may limit you to adding just a few hosts at a time, which means it will take multiple iterations and tens of minutes more to scale up your remaining partitions to handle the removal of one partition. Increase your step-up values as needed to ensure your application autoscales with just one iteration to reduce recovery times. Check and test your scaling policies to ensure they can handle a scale-up percentage based on your out of service to in-service ratio calculation.
Prevent Failure Expansion
Failure isolation, redundant capacity, and automatic removal of impaired application partitions will lead to faster more reliable recovery times compared to rollback mitigations. However, you must put in additional mitigations to protect your redundant capacity and to prevent propagation of a problematic changes by using “Blockers”. Blockers are pre-condition checks before performing an action.
Blocker to prevent propagation of a bad change: A deployment system should push changes only if the target application is healthy, regardless of your mitigation strategy. A deployment pipeline should check that there are no production alarms before each incremental step of deployment to prevent a problematic change from being pushed to more and more hosts. If you don’t have this type of blocker, a bad change could easily be pushed to your entire fleet, causing much larger impact than if just a few hosts had been updated. If the change is pushed to all hosts, rollback is the only viable mitigation.
Blocker to protect critical capacity from potentially problematic changes: A deployment system should also check that the target application has sufficient redundant capacity before beginning a deployment. If you are using three AZs as a logical failure partition, then you should have 3 AZs in service for a deployment to proceed. If one of your AZs is out of service, deployments should be blocked. If a deployment proceeds with 1 AZ out of service, then you won’t have sufficient capacity to remove an AZ from service to mitigate change related failures. There’s a good chance you won’t have enough capacity to support client retries either. If you don’t have sufficient redundant capacity, your only mitigation option would be rollback
Blocker to protect critical capacity from removal: Any system that removes hosts from service should block when there is insufficient redundant capacity. If you normally run in three redundant AZs, and have enough capacity to lose one 1, the system should only be allowed to remove capacity if there 3 AZs in service. This kind of blocker can also prevent host-removal automation from running out of control taking all of your application instances out of service.
Blocker to prevent capacity removal or deployments when overcapacity Your systems should prevent host removal for any capacity exhaustion event, whatever the cause. This will ensure that if a flash crowd consumes all available capacity and causes application errors, the automated recovery system won’t remove capacity and make the failure worse. Deployments should block for the same reason, as deployment processes often take some capacity offline.
Root Cause then Rollback
You may lose an oppurtunity to get to root cause when you use autorollback. When failures are mitigated without rolling back, you buy yourself time to troubleshoot the issue. Some failures may only occur in production. With the impaired partition out of service, you can take the time you need to identify the root cause. After you have gotten a high confidence root cause, you can rollback the change and put the repaired partition back into service.
Some changes cannot be scoped to a single partition of your application. In these cases, rollback may be your only mitigation option if the change goes badly. So even after you’ve implemented a faster and more reliable primary mitigation, you’ll want to keep your rollback processes in good working order.
Attain higher levels of resilience and faster recovery times by building fault isolation in your service and mitigating failures by removing impaired application instances from service. This technique can give you recovery times of less than 5 minutes while protecting you from a wide range of Code & Config, or Infrastructure failures. Failure mitigation times can go as low as ~2 seconds or less with determinstic client retry logic. This mitigations are more reliable than rollback, because mitigation doesn’t depend on identifying the problematic change. You only need to identify unhealthy application instances and reroute clients to healthy application instances.