Risk Analysis

Five 9s isn't enough

Five 9s isn't enough

Go beyond 9s service level objectives, recovery time objectives and recovery point objectives. Improve your resilience decisions with failure impact narratives.

Most organizations set some form of resilience objectives using service level agreements (SLA) in 9s notations. Some organizations formalize their recovery time objectives (RTO) to determine how long it should take an application to recover or take recovery point objectives (RPO) to determine how much data can be lost in the event of failure. A few organizations write pages of detailed non-functional resilience requirements. SLAs, RTOs, RPOs and requirements aren’t enough information to decide how much effort and money an organization should spend on resilience.
Five Categories of Failure

Five Categories of Failure

Failure categories are a simplified approach to understanding distributed systems failures and what to do about them. Without failure categories, failure mode analysis and system design tends to operate off of a list of thousands of specific failures, which tends toward one-off approaches to failure detection and mitigation. Failure categories can help you design a few common approaches to detect and mitigate a wider range of failures.

Traditional Failure Analysis Even simple distributed systems are extremely complex. A single transaction may use hundreds of computers and many networks. Distributed systems need DNS names, SSL certificates, a myriad of security credentials, layers of software and layers of networked devices connecting everything together. Any of these components can fail and impact an application. There are lots and lots and lots of ways for distributed systems to fail. Organizations that have been building and operating distributed systems for any period of time have long lists of failure modes and what to do about them.