Linux has come a long way with tooling for simulating network failures. The traffic control (tc) tool has a wonderful bag of tricks for simulating total packet loss, intermittent packet loss and delays. Tc even provides random distributions of packet loss for additional realism in testing. However, tc doesn’t support random % of network flow impacts. As a reminder, network flows are defined as a 5-tuple of ip address, ports and protocol.
High availability clusters include things like MySQL or PostgreSQL using synchronous replication. Any implementation of RAFT/PAXOS is a high availability cluster and is part of services like Aerospike and Cockroach DB, Consul, etcd or zookeeper. Clusters are often used behind the scenes and there’s a good chance they are in your environment. For example, Kafka and Kubernetes both use cluster technology in their management layers. Even if you aren’t running high-throughput clusters, this article will help you understand how cloud networks and the Internet behave when failures occur.
Designing a monitoring solution for critical applications can be a little tricky, especially if your failure recovery automation depends on monitoring for detecting and responding to failures. Critical applications are those that need to hit 99.999% or better uptimes, are sensitive to error rates of 5% or lower, need to recover from a wide-range of failures in less than 5 minutes, or must be resilient against unusual major failures like a total unrecoverable loss of a datacenter.
Code & Config failures usually occur around a change. It might be a code deployment or a manual configuration change. Many organizations rely on rollback procedures to mitigate failures when problematic changes occur. If you are targeting 99.999% uptime or recovery times of less than 5 minutes, then a rollback isn’t an ideal mitigation. Rollbacks won’t reliably mitigate change related failures on <5 minute timelines. This article explains why rollbacks are too slow and unreliable to act as a primary mitigation mechanism for 99.