Articles

How to Simulate How Cloud Networks Fail

Cloud networks, and most massively scaled networks, are subject to partial failures that will impact some of your application connections but not others. Congestion will lead to intermittent packet loss and is another flavor of partial impact that occurs in multi-tenant environments. How do you know if your monitoring, health checks and failure mitigations, like host removal and retry, will mitigate these kinds of failures? You test! This article will show you how you can use Linux IPTables, routing policies and network control to simulate cloud network failures.

Linux has come a long way with tooling for simulating network failures. The traffic control (tc) tool has a wonderful bag of tricks for simulating total packet loss, intermittent packet loss and delays. Tc even provides random distributions of packet loss for additional realism in testing. However, tc doesn’t support random % of network flow impacts. As a reminder, network flows are defined as a 5-tuple of ip address, ports and protocol.

How Cloud Networks Fail and What to do About It

Massively scaled cloud networks are composed of thousands of network devices. Network failures (usually) show up in tricky ways that your monitoring won't detect. These failures (usually) lead to minor client impacts. However, some critical applications, like strict data consistency database clusters, are sensitive to even these minor disruptions. This article will explain the why behind the funny behaviors you might have noticed in your applications running in the cloud or over the Internet. This article includes recommendations for how to improve application resilience to detect and mitigate these (usually) minor failures in the massive networks they depend on.

High availability clusters include things like MySQL or PostgreSQL using synchronous replication. Any implementation of RAFT/PAXOS is a high availability cluster and is part of services like Aerospike and Cockroach DB, Consul, etcd or zookeeper. Clusters are often used behind the scenes and there’s a good chance they are in your environment. For example, Kafka and Kubernetes both use cluster technology in their management layers. Even if you aren’t running high-throughput clusters, this article will help you understand how cloud networks and the Internet behave when failures occur.

Most organizations set some form of resilience objectives using service level agreements (SLA) in 9s notations. Some organizations formalize their recovery time objectives (RTO) to determine how long it should take an application to recover or take recovery point objectives (RPO) to determine how much data can be lost in the event of failure. A few organizations write pages of detailed non-functional resilience requirements. SLAs, RTOs, RPOs and requirements aren’t enough information to decide how much effort and money an organization should spend on resilience.

Traditional Failure Analysis Even simple distributed systems are extremely complex. A single transaction may use hundreds of computers and many networks. Distributed systems need DNS names, SSL certificates, a myriad of security credentials, layers of software and layers of networked devices connecting everything together. Any of these components can fail and impact an application. There are lots and lots and lots of ways for distributed systems to fail. Organizations that have been building and operating distributed systems for any period of time have long lists of failure modes and what to do about them.

Designing a monitoring solution for critical applications can be a little tricky, especially if your failure recovery automation depends on monitoring for detecting and responding to failures. Critical applications are those that need to hit 99.999% or better uptimes, are sensitive to error rates of 5% or lower, need to recover from a wide-range of failures in less than 5 minutes, or must be resilient against unusual major failures like a total unrecoverable loss of a datacenter.

Code & Config failures usually occur around a change. It might be a code deployment or a manual configuration change. Many organizations rely on rollback procedures to mitigate failures when problematic changes occur. If you are targeting 99.999% uptime or recovery times of less than 5 minutes, then a rollback isn’t an ideal mitigation. Rollbacks won’t reliably mitigate change related failures on <5 minute timelines. This article explains why rollbacks are too slow and unreliable to act as a primary mitigation mechanism for 99.

How to Simulate How Cloud Networks Fail

How Cloud Networks Fail and What to do About It

Five 9s isn't enough

Five Categories of Failure

Monitoring Design for Critical Applications

Stop Relying on Rollback