Theory

Five Categories of Failure

Five Categories of Failure

Failure categories are a simplified approach to understanding distributed systems failures and what to do about them. Without failure categories, failure mode analysis and system design tends to operate off of a list of thousands of specific failures, which tends toward one-off approaches to failure detection and mitigation. Failure categories can help you design a few common approaches to detect and mitigate a wider range of failures.

Traditional Failure Analysis Even simple distributed systems are extremely complex. A single transaction may use hundreds of computers and many networks. Distributed systems need DNS names, SSL certificates, a myriad of security credentials, layers of software and layers of networked devices connecting everything together. Any of these components can fail and impact an application. There are lots and lots and lots of ways for distributed systems to fail. Organizations that have been building and operating distributed systems for any period of time have long lists of failure modes and what to do about them.