Risk Analysis

Most organizations set some form of resilience objectives using service level agreements (SLA) in 9s notations. Some organizations formalize their recovery time objectives (RTO) to determine how long it should take an application to recover or take recovery point objectives (RPO) to determine how much data can be lost in the event of failure. A few organizations write pages of detailed non-functional resilience requirements. SLAs, RTOs, RPOs and requirements aren’t enough information to decide how much effort and money an organization should spend on resilience.

Traditional Failure Analysis Even simple distributed systems are extremely complex. A single transaction may use hundreds of computers and many networks. Distributed systems need DNS names, SSL certificates, a myriad of security credentials, layers of software and layers of networked devices connecting everything together. Any of these components can fail and impact an application. There are lots and lots and lots of ways for distributed systems to fail. Organizations that have been building and operating distributed systems for any period of time have long lists of failure modes and what to do about them.

Five 9s isn't enough

Five Categories of Failure