Most organizations set some form of resilience objectives using service level agreements (SLA) in 9s notations. Some organizations formalize their recovery time objectives (RTO) to determine how long it should take an application to recover or take recovery point objectives (RPO) to determine how much data can be lost in the event of failure. A few organizations write pages of detailed non-functional resilience requirements. SLAs, RTOs, RPOs and requirements aren’t enough information to decide how much effort and money an organization should spend on resilience. They don’t explain the ‘why’ behind the requirements.
I’ve found most companies will tell you a story about why they set their resilience objectives the way they did. In my experience, those narratives are more useful than the objectives themselves. A good resilience narrative explains the ‘why’ and can help you align budgets and empower people to right-size the resilience of your application.
I’ve worked with organizations that claim an uptime goal of 99.999% (no more than ~5 minutes down time per year) only to balk at the idea of paying 2x more to have pre-scaled redundant capacity for their applications. Without narratives that describe the impact of failure, it is difficult to assess whether something like pre-scaling capacity is right-sizing or overspending.
Understanding the impact of failure is a prerequisite to setting right-sized resilience objectives with properly aligned budgets. This article includes a set of questions I’ve found most useful when deciding how much to invest in resilience and how to go about meeting particular resilience objectives.
Tell a story about service impact
Resilience in distributed systems is a probability game without any real data to make by-the-numbers decisions. Distributed systems and cloud services are complex and change continuously, even if the operating history of these systems was readily available (and it is not), past performance isn’t a predictor of future performance. The would-be data would also become outdated as soon as its produced, because systems change continuously and with change comes new ways for a system to fail. Because the data isn’t there, and won’t be there, your resilience decisions will rely on human judgment involving people in your organization as well as vendors trying to sell you on their services.
If you approach a cloud vendor and say, “My application needs to meet 99.999% availability” you will find many suitors. If instead you say, “My application needs to meet 99.999% availability, and if it is down for more than 5 minutes, the world economy will lose $10B per minute” cloud vendors may get a little more honest with you about the resilience capabilities that exist (or don’t) within their services. For everything in between, you might find cloud vendors are happy to sign you up and will later wash their hands of responsibility when their service causes your application to suffer a 4 hour outage. Resilience narratives can help you cut through the marketing fluff.
For your engineering teams, narratives provide more context to understand the nuance behind resilience objectives. These details help engineers design viable solutions to resilience risk. A good one page narrative can take the place of pages of detailed requirements.
For example, imagine we have an application that must meet an RTO of 15 minutes or less for a datacenter failure. If we also say, this application is a data repository and central communication system for coordinating disaster response in the United States of America. The system is used to determine where to ship medical supplies, where to direct disaster response experts and how to triage individual patients through a national-scale disaster. National-scale disasters supported by this system include earthquakes, hurricanes, floods and even nuclear strikes. Some operations this application coordinates, like flight for life helicopter dispatch for patient relocation, are life-critical. Back-up coordination mechanisms exist, like radio and phone. However, without this application, triage becomes slower and prone to error. As an example, a helicopter may deliver a needy patient to a hospital with an over-capacity intensive care unit even while other nearby hospitals have spare capacity.
This resilience narrative snippet immediately tells engineers that a failure of the application could put lives at risk. There are some back-up systems, so application designers can reason that people aren’t likely to immediately die if the application has a 15 minute or potentially longer outage, but risk of death does increase. We can also see that the recovery-time requirement has to work even for events like nuclear strikes. With that context, we must place redundant application infrastructure with nuclear strikes in mind. Ideally if one site is impacted by an event, the other sites would continue to operate.
Most businesses probably don’t need to be resilient to nuclear strike events. A good narrative makes clear who your customers are, how they use your service and what happens if your service experiences a prolonged disruption. For many businesses, the cost of an impact is calculable in terms of lost sales, SLA penalties or service credits. Other costs may include legal risks. For example in Singapore, officers of some critical service infrastructure can be jailed for negligence if their service is subject to a major availability impact.
All of these impact details can help your organization understand what is at stake when failures occur. This empowers people in your organization to make better decisions about how to design and support your application.
Questions to ask yourself
- Who are my direct customers? Another way of asking this is: Who are the people or organizations of people that I expect will use my application?
- Who are my indirect customers? Are there any indirect people or organizations that I expect will use my application? For example, if you are building a banking platform, banking organizations may be your direct customers, but their customers include individual consumers or business organizations.
- How do my direct and indirect customers use my application?
- What happens to my direct and indirect customers if my application is unavailable? Answers to these questions can help you align monetary and effort budgets. For example, learning that an outage costs your business $1M per minute can help you justify spending additional engineering effort or investment in additional redundant capacity.
- Will people get hurt?
- Will people lose money or livelihood?
- Will people lose their freedom by jail or inability to move freely?
- Will there be long term effects? For example, some organizations may have a hard time winning back customers after a big enough outage. They may suffer reputational damage that could take years to recover. Some backs may have legal limitations placed on what they can do with their business going forward.
- To what extent will people or organizations be hurt? How many could be hurt?
- Does my application have more critical and less critical functionality?
- What are the groups of less critical functions?
- What are the groups of most critical functions?
- When critical functions my application fails, do my direct and indirect customers have access to reasonable back-up functionality by other means? In the example above, radio and phone provide redundant but reduced functionality communication networks for coordinating a national medical response.
Align your investment expectations with the story of impact
No one wants a service outage, but failures will happen. However, the fastest recovery times or even no-impact service designs can consume a nearly ending amount of time in thinking up new failure modes, designing and implementing mitigations and continuously testing to validate the mitigations work as expected. If you need bespoke solutions and a team of distributed systems experts dedicated to keeping your application running, it will cost you more money. If you need the highest levels of resilience, you will need to pay more money for redundant storage, network and compute capacity. These costs increase with application complexity, scale and resilience needs.
Not all services need maximum investment, some do. This is where an impact narrative with costs and penalties can inform the level of investment an organization should make. Below I describe a few of the common tiers of investment based on recovery times. The impacts are particular time internals can help you determine how much effort and money to spend.
Questions to ask yourself
What is the range of cost impact for failures at varying outage durations? Different customer impacts emerge at different severities and durations. For example, a web application that has a <1% error rate for 6 hours probably won’t be noticed by many customers. We’ll dig into error rates and latencies a little later. For this section, assume we’re talking about a total outage. You can expect to see different impacts emerge for a ~5 minute total outage versus a multi-hour total outage. Understanding these impacts can help you identify right-sized failure mitigation strategies for your application. The choice of strategies spans from relying on operators to respond to most failures or investing more to build an application with automated mitigations to protect against a wide range of failures.
- What impacts could occur if your application is down for more than 1 day? If an application takes longer than 1 day to recover, it could be because an experienced operations team isn’t readily available to respond to a failure. Staffing a more experienced team with 24x7 support can shift >1 day outages to something ranging from ~1 to ~4 hours. However, impacts that cause large scale data corruption could take multiple days of data processing to recover, even with experienced teams operating 24x7. Reducing mitigation times for data corruption type failures requires complex application engineering effort and could include ~2x or ~3x more storage and compute costs as compared to a minimally redundant design.
- What impacts could occur if your application is down for more than ~4 hours? A 24x7 operations team familiar with the application can often diagnose and mitigate tricky issues in this time frame. Tricky issues include failures which lack a predetermined mitigation procedure or failures that require a less than ~1 hour of data repair time. This timing also assumes that teams have access to the resources they need to quickly mitigate an issue, including things like spare capacity and the ability to create and deploy a code patch or configuration changes in ~15 minutes to ~1 hour.
- What impacts could occur if your application is down for more than ~1 hour? Keeping a wide range of failures to less than 1 hour in duration requires thoughtful application designs and predetermined operator-driven mitigations. Consistently meeting this target requires ongoing investment to ensure operations teams have fresh experience running all critical recovery procedures. The resilience features of an application and operator tooling require regular use to stay in working order. Teams will need a good set of monitoring and metrics to quickly identify and diagnose an application failure. Hitting a <1 hour recovery time includes ~15 minutes to engage an operator, ~5 to ~15 minutes to diagnose the failure and identify the appropriate mitigation and ~5 minutes to ~30 minutes to implement the mitigation.
- What impacts could occur if your application is down for ~5 minutes? Keeping a wide range of failures to ~5 minutes of impact requires built-in failure prevention and fully automated failure mitigations for a wide-range of failures. The challenge is automated failure mitiation designs are complex. That complexity introduces its own failure modes. Still, some applications truly need this level of resilience. With the right people and the right level of investment, recovery times of ~5 minutes and even no-impact 0 minute recovery times are achievable.
Off the shelf tooling, like Kubernetes and load balancers, will allow you to meet ~5 minute recovery times for a narrow band of failures, mostly hard-down host Infrastructure failures. They will not protect you from large-scale infrastrucure failures like loss of a datacenter, Client-Originated failures like flashcrowds or poison pills, or any failures that lead to low-level elevated error rates within your application. Getting ~5 minute recover times for these types of failures will require bespoke application design and on going expert support.
- How are my customers impacted by multiple impacts a year? Failures are a probability game, which means it is possible for an application to experience the equivalent of a 100 year flood two or more times in a year. How does the impact change for multiple annual occurrences?
If you don’t know the cost of impacts, you can come at this the other way and ask, “What am I willing to pay to recover in less than X?". In my experience, it is more difficult to produce an objective answer using this approach. It is easy to bias to overspending for fear of any outage, or underspending to minimize spending on things that don’t add value for your customers. However, this method can sometimes be a shortcut to identifying misalignments.
This set of questions helps a team align to resilience strategies that best align with the cost of impact. However, no matter how much investment is made, impacts can still occur. I once heard a technical leader remark on a team’s proposed SLA, “100% SLA, what does that really mean? Even the earth can’t provide a 100% SLA."
The answers to these questions can be turned into narratives that explain application resilience designs or operating procedures. For example, “In the event of a major disaster, 1 hour application outage will delay deployment of critical medical supplies by 4 hours, and can result in loss of life if things like deployment of blood for transfusions is delayed. To ensure our operations teams and our the failure mitigations in our applications are always ready, we practice all application recovery procedures on a quarterly basis.”
Narratives can be further extended to explain key investment decisions, “We maintain redundant network, storage and compute in three data centers. We run all three in active-active configuration so that we know immediately if any of the three stacks becomes impaired. We maintain 50% additional spare capacity in each datacenter so that if we lose one, we are guaranteed to have enough capacity to host the workload in the remaining two datacenters.”
Fine tune customer impact stories for higher quality of service
You have a wide range of choices in application design and failure mitigations. The better you are at understanding your systems tolerance to error rates, latencies and data consistency, the better you will be able to invent application-specific mitigations. This is admittedly more of an art than a science due to the complexity involved. Still there are a few overall parameters that are very useful to understanding application constraints and the kind of solutions designs you can use to improve application resilience. As an example, an application that can tolerate some staleness in data can implement application-local data caches to keep read operations running, even if an underlying database fails. An application with clients that implement retry logic can tolerate some level of error rate before the clients experience an availability impact. You might not know the answer to these questions, that’s ok. Some answers to these questions may become apparent over time as clients use your application.
Questions to ask yourself
- What is the impact of processing clients requests out of order? Is there any out of order processing that would cause customer impact? As an example, a payment processing application that processes debits and credits to an account could lead to overdrawn accounts, or denying access to funds that accounts actually have.
- What is the impact of data loss? Is loss of a single transaction a must not event, or can you survive loss of 30 seconds of data in the event of a major (once in four years) failure? You may be tempted to say no single transaction can be lost. You can implement systems that meet this goal, but it comes at a great cost to transaction throughput and availability. Most applications can tolerate a little bit of data loss in the event of a total unrecoverable loss of a datacenter, and having that tolerance allows for higher availability and higher throughput application designs. Choosing a more relaxed 30 seconds of potential dataloss allows you to meet higher through put and availability goals, but requires additional network, storage and compute costs to store live redundant data replicas.
- Is there any place where operating off of stale data would cause customer impact? If your clients can tolerate a few seconds of delay for seeing updates, it allows you to cache data and use of eventually consistent datastores. Both of these types of stateful systems can achieve higher throughput and higher availability than systems that cannot tolerate any staleness in data. Most applications can tolerate some data staleness.
- At what point does service response latency change from being an inconvenience to an availability impact? For many web applications, >5 seconds response times can turn into availability impact. For high throughput payment applications, latency increases of just 100s of milliseconds can turn into an availability impact. Knowing this tolerance can help you set recovery time objectives for automatic mitigations.
Build an organization and a service around your resilience story
How does your application compare to most other distributed systems around the world? If your application falls at or below a typical distributed system application, you’ll find readily available tools and won’t need as many experts dedicated to supporting your application. You may even be able to use free software to build your application and free services to host your application.
The more your application is uniquely complex, high scale or needs better resilience, the more you will need to invest in a team of dedicated experts to build and support your application. Unique applications require bespoke solutions. You may need to spend more money to purchase additional redundant storage, network and compute capacity for your application.
Questions to ask yourself
- Is my application closer to a static web site with data that infrequently changes or is it more like a real-time bidding system where data is changed often and subsequent transactions depend on prior transactions?
- Can my application be built with a wide-range of 3rd party libraries, or must I develop a large amount of unique software?
- Is the functionality in my system simple and all operations similar or does my application provide a wide-range of functionality with some operations that are very simple and other operations that require a great deal of computation?
- Will my application see a few hundred requests per second, or >100,000s of requests per second?
- Will my application see just a few update transactions per second or more than 10k transactions per second?
- Is my application data ephemeral or easy to recreate or must every data transaction be stored durably, even in the event of total loss of a datacenter?
- Is my application something like a daily batch processor where a few hours of outage won’t cause much or any impact or is it a real-time system where even a second of outage is problematic?
- Is my application a typical enterprise business application or must I meet critical infrastructure resilence requirements such that my application must across multiple redundant datacenters sepearated by not less than 1,6009 km (1,000 miles)?
Answers to these questions can turn into helpful resilience narratives that keep your team aligned. As an example, “Our application is a securities exchange that must process 6k transactions per second synchronously-ordered with zero data loss in the event of a total loss of one of two data centers located on either side of the United States. The latencies involved and our strict data consistency requirement limits transaction throughput. None of the database applications we tested were able to hit our throughput targets. As a result, we decided to use a low-level software package to help us build our own multi-region cluster and we wrote our own in-memory transaction processing solution.”
This snippet of a resilience narrative will help engineers come up to speed on your application and why you chose to implement complex clustering strategies rather than just using a cloud vendor database. This snippet also helps everyone know when a design decision can be revisited, if at some point a 3rd party datastore hits the throughput targets between remote data centers while maintaining strict transaction ordering, then it could be swapped in, and you no longer have to maintain your own one-off implementation.
Focus on creating narratives around the most unusual aspects of your application. When you find you have unusual needs, make sure you’ve got the right experts in place to support those unique aspects of your application. For our example above, you’ll want a few people with experience building applications with clustering technology, including things like PAXOS leader election and distributed locks.
Investing for the highest level of resilience
The world’s most critical applications are shifting away from traditional mainframe systems and toward modern service-oriented distributed systems often using cloud service offerings from Amazon AWS, Microsoft Azure or Google Cloud. The reason is many fold. During COVID some banks were unable to scale to meet the surge in demand for Internet banking, and there are hard limits to how much you can scale up legacy mainframe systems. Mainframe customers are also challenged to find a team of people that know how to write in legacy mainframe code. As a result, their ability to apply security updates, fix bugs or write new capabilities is limited. Many enterprises are addressing these concerns by modernizing their application stacks and are looking toward the cloud to reduce costs and increase agility while maintaining very high levels of availability for their critical applications.
Today, there are a very small number of software as a service solutions that consistently meet 100% or even 99.999% uptime year over year. Even if those services haven’t had a major outage, many are still susceptible to prolonged multi-hour outages. For example, Facebook suffered a 5.5 hour global outage that cost the company an estimated $60M in advertising revenue. Even though Facebook runs many redundant data centers and networks world-wide, a bad change caused their system to propagate a routing change that made all of their services unavailable. An application that truly needs to meet 99.999% uptime from a wide range of failures, including a total loss of a datacenter or elevated error rates caused by network or service impairment, must manage its own redundancy and failure mitigation solutions rather than using software as a service as-is in a non-redundant fashion. Resilience narratives are the first of many steps needed to design a right-sized resilience solution.