This site shares practical advice based on first principles to help organizations build resilient distributed systems. Distributed systems are complex. Banks, hospitals, supply chains and social communication all use interdependent distributed systems technology. Our world depends on them.
Today, we lack tools and methods to engineer distributed systems for best-possible resilience. Too often, teams build distributed systems assuming the networks, infrastructure and services they depend on will operate with little to no failure. Teams incorrectly assume using cloud services or modern service platforms like kubernetes will just handle failures for them. Achieving right-sized resilience is even more challenging, because setting resilience objectives and aligning time and money investment decisions involves complex high-judgment decisions.
A first principles approach is one that is based on a core underlying assumption about the nature of distributed systems. Things like CAP Theorem or my preferred expanded model PACELC Theorem translate two core assumptions about distributed systems into trade off decisions most distributed systems engineers will encounter, even if they don’t know it. The core assumptions of CAP are that distributed systems involve two more computers, communicating over networks and that these communications can be no faster than the speed of light. Distributed systems engineers must choose two out of three from Consistency, Availability or Partition tolerance. That choice wouldn’t exist if not for communication latency between computers.
Theorum’s are good examples of first-principle approaches, but there isn’t much to help organizations apply these principles. How does varying amounts of latency affect their trade off decision? Can an application meet its throughput and resilience goals running in data centers separated by 2,000 miles of distance? This site will help you figure out these questions and more.
Nathan offers technical reviews and consulting for organizations including those designing for 99.999% uptime, multi-Region, disaster recovery or active-active systems.
Nathan has more than 20 years experience building and operating Internet-scale distributed systems and has 10 years of experience at Amazon Web Services. While at AWS, he launched new services and major features critical to the resilience and security of customer applications. Nathan launched an AWS initiative to help customers engineer for 99.999% or greater uptime. Nathan and his team built new specialized AWS services and advised customers who were migrating their most critical applications to the cloud. Nathan specializes in application designs that require advanced fault isolation and automated failure recovery, including multi-Region and high-throughput applications, like those used in financial securities trading and settlement or large scale multi-tenant systems. Nathan is an expert on design, root cause analysis, corrective actions and operations for distributed systems applications.