How Cloud Networks Fail and What to do About It

Massively scaled cloud networks are composed of thousands of network devices. Network failures (usually) show up in tricky ways that your monitoring won't detect. These failures (usually) lead to minor client impacts. However, some critical applications, like strict data consistency database clusters, are sensitive to even these minor disruptions. This article will explain the why behind the funny behaviors you might have noticed in your applications running in the cloud or over the Internet. This article includes recommendations for how to improve application resilience to detect and mitigate these (usually) minor failures in the massive networks they depend on.

Page content

High availability clusters include things like MySQL or PostgreSQL using synchronous replication. Any implementation of RAFT/PAXOS is a high availability cluster and is part of services like Aerospike and Cockroach DB, Consul, etcd or zookeeper. Clusters are often used behind the scenes and there’s a good chance they are in your environment. For example, Kafka and Kubernetes both use cluster technology in their management layers.

Even if you aren’t running high-throughput clusters, this article will help you understand how cloud networks and the Internet behave when failures occur. You will learn how different types of network failures in massively scaled networks can affect your applications. While many of the recommendations in this article are focused on cluster systems, these recommendations can be adapted to any distributed system.

tl;dr - Too long and too Technical, didn’t read

If you don’t care about the technical details and want to skip to the, “so what do I do about this?” then this section is for you. For the rest of you, the remainder of this article will explain the why behind these recommendations.

How do massive networks fail? (short version)

Massive networks are composed of 1,000s of devices. They fail all the time, every day. Some cloud providers have invested in automated failure detection and recovery, so client traffic is routed around failures as fast as ~10 ms to less than a minute for hard-down device and link failures, and between 3 and 7 minutes for more complex problems, like network congestion. Some providers may rely on operator intervention, which can take 15 minutes to 1+ hours, to mitigate failures.

Even though recovery time varies provider to provider, the effects of failure are similar. The connection between any two hosts, like a client and a service, is likely to be partially impacted rather than fully impacted when failures occur in massively scaled cloud networks. The partial impact will affect some network flows, like individual TCP connections, but not others.

Any health checks using their own connection, separate from the application connection, can succeed while your application connections fail. Not only that, health checks will only respond to total loss of connectivity, not intermittent packet loss . Any automated failover systems that rely on health checks, like load balancers, DNS and your database clusters, will not automatically recover from partial network failures common in cloud environments.

The good news is that most of the time, and especially when you use a leading cloud provider, those network failures will recover quickly on their own. Sometimes the scope of failure is so small and recovery is so quick that the network protocols your applications use will retry and recover with only minor latency increases. The bad news is, if you operate a truly critical application or one that operates very close to its throughput constraints, then even these minor disruptions can turn into a big impact. For example, database clusters using no data loss replication (0 RPO) with nodes separated by thousands of miles and processing thousands of writes per second can experience availability impacts for any network disruption that lasts longer than tens of milliseconds. PAXOS/Raft clusters found in the newest database software are a little more resilient, but will also experience performance and availability impacts due to network disruption.

What to do about it (short version)

Don’t rely on health checks alone. Health checks won’t detect partial network failures, including failures that affect a subset of connections or network congestion that causes intermittent packet loss. Use client-side error metrics and latency metrics to detect partial network failures. This article has more details on the full set of metrics you should have.

Use new connections on retry or many in parallel. Configure your client applications to use a new connection on retry or leverage something like AWS’s Scalable Reliable Datagram (SRD) detailed in this article. Using multiple network paths in parallel can route application connections around network failures and congestion.

Directly monitoring critical dataplane connections. Some high availability cluster technology implements host to host health checks using separate connections from data replication connections. These implementations assume connectivity impacts affecting the health check will impact all other connections between the hosts. Cloud network failures affect a subset of connections between hosts, so this health check strategy is less likley to detect partial network impacts in Cloud environments. Make sure your cluster failover is directly monitoring the data replication health between cluster nodes. Inline health checks that use the actual client and cluster connections can work. However, you must also prevent all of your capacity from being taken offline when reachability issues occur, but nodes remain healthy.

Test your critical applications by simulating partial network failures. Test your application behavior during partial connection impairments. Note this isn’t just about testing packet loss. Cloud network failure simulation should impact between ~5% to ~33% of your network connections. I recommend you test by randomly impacting a subset of your application connections, and then by targeting specific ports to disrupt any specialized communication channels your application is using. For example, target cluster node to cluster node communication separate from client to cluster node communication and vice versa. Check back on this site for a future blog post detailing how to simulate partial cloud failures.

How do massive networks fail? (long version)

In the long ago, high-availability meant paying a lot of money for a hardware device with lots and lots of built-in redundancy. Redundant power, network links, back planes, control planes, the works. These core routers were big and bulky, and you only had one or two of these things in a datacenter. These days, cloud providers and large scale network providers are taking a different approach.

The reason is simple, many service providers have grown beyond the scale of even the largest network devices. Not only that, but even if there was one router that could handle the scale of a whole AWS Region, a failure of that router has too large a blast radius. Cloud providers and some large scale Internet service providers are shifting to many smaller commodity network devices and away from large mainframe-like network devices. AWS talked about their move to commodity hardware back in 2016 in this presentation.

Core Routers

Photo supplied by Cisco Systems Inc. of a CRS-1 Router. Shared here under the Creative Commons Attribution-Share Alike 3.0 Unported license.

Massive networks use ECMP to scale

In order to join many smaller network devices together for scale, most providers use equal cost multi-path (ECMP) routing to load balance network traffic. AWS briefly talks about their use of ECMP in their Deep dive on networking infrastructure re:Invent presentation. Google talks about their use of EMCP for their Maglev network load balancer, these are just two examples, you’ll find that ECMP is a popular load balancing mechanism for just about any network-layer service.

Though ECMP can be hashed a few ways, AWS, Google, and many other massive-scale network providers use the 5-Tuple of source IP address, source port, destination IP address, destination port, and protocol as the hash key. The 5-tuple uniquely defines a flow. For TCP protocol, this flow is akin to a TCP connection. UDP is connectionless, but as long as the 5-tuple in packets are the same, they are considered part of the same flow.

If any of the values of the 5-tuple differ, a flow will be hashed and randomly assigned to a different path through a different set of network devices as packets traverse each tier of the cloud network. All of the packets in a flow will follow a consistent path through a consistent set of devices. This is necessary for certain kinds of network devices, like firewalls and load balancers, which may need to track connection states to perform their function. Also note that ping requests use ICMP, which is a different protocol than TCP or UDP, so it will always hash as a separate flow from your application traffic. So you might be able to ping between two hosts, but still have a problem with connectivity between two hosts.

Flow Table

The table above shows a specific example of a client connecting to a web server. In this case, the client creates three connections to the server. The server is listening on the same IP address and the same port for each connection 172.31.103.96:80. The client uses the same IP address (172.31.39.218), but uses a different port number from the ephemeral port range, 2837, 10945 and 1608 for each connection. Each of these unique flows will be hashed at many tiers in a cloud network to determine which link on which router that flow will take.

In a cloud environment, many tiers of the network are 100s to 1,000s of devices wide. ECMP flow hashing occurs at each of these tiers to determine which of the 100s or 1,000s of devices your connection will use. It is very likely that the three flows in our example will use a unique set of network devices, even though the client and the server IP address are the same. This means that a failure of one device or network congestion in one network interface can impact one of your connections, but not the others.

The diagram below gives you a visual idea of ECMP flow hashing across many network tiers. Connection A flows through an unimpaired set of network devices. Whereas, connection B flows through a device that is experiencing network congestion. Note that a health check agent, perhaps running on a load balancer, is a separate source IP address and doesn’t traverse all of the network tiers the client uses. This approach is fine for detecting server failures, but it won’t dectect reachability issues caused by failures in network tier 1 or 2.

Multi-Connection Multi-Path

There are benefits and trade-offs with this approach. This kind of scale-out approach is great for minimizing impact. Instead of a total outage, network device failures lead to partial failures. Sometimes these failures are so small and so brief, protocols like TCP automatically protect your application connections. However, these kinds of partial failures are really difficult for you to detect. Traditional health check approaches won’t work. Certain high-availability applications aren’t designed for partial connectivity impacts. If you want to know if connection A is healthy and connection B is unhealthy, you need to monitor both of those connections directly. This isn’t easy to do.

Using single-connection HTTP2 instead of HTTP 1.0/1.1 helps, because HTTP2 multiplexes many requests through a TCP single connection. Any health checking through the HTTP2 connection will follow the same network flow path as your actual requests. Any strategy of using just one connection instead of many simplifies error detection, but it also means the affected client is more likely to experience a total failure instead of a partial failure.

Just about any client application can be written to create a new connection when certain kinds of failures occur. As we’ve seen from the example, a new connection will get a new source port, which means it’ll get a new unique path through a cloud provider network. That new path may not be impacted by whatever failure has occurred.

With a better understanding of how cloud networks scale, you can dig into your application to assess resilience posture in the event of partial network failures. Is your application using many connections or a single connection? How does your application measure the health of its connection between the client and the service? Does that method use the same connection path as the actual clients?

The massive scale of cloud networks and their use of ECMP flow hashing means the scope of network failures usually impacts a subset of flows between devices, rather than impacting all flows. Now let’s dig into the timing, error rates and probability of failure in massively scaled networks.

Network failure timing and effects of Cloud vendors

All of the major cloud providers invest heavily in automation. We don’t get to know all of the specific details about the automation they use, some may talk about their automation under strict NDA. AWS talks generally about what they do in their Deep dive on networking infrastructure re:Invent presentation. I cannot publicly disclose what AWS was doing while I worked there, but I will describe what is possible and how various types of failures can impact your application. If you are an operator of truly critical infrastructure, ask your cloud provider about the kind of network automation they are using. This section can help you come up with some pointed questions to find out what kind of capabilities your vendor has under the covers.

We will use the Five Categories of Failure as a way to understand timing and effects of cloud network failures. For networks, the five categories of failure look like this. First, are Infrastructure failures including hardware failures of a whole device, individual ports, or cable-cuts. Second, are Code & Config failures caused by misconfigurations of a network device or firmware bugs. Third, are Data and State related failures, including things like corrupt data in route announcements. For stateful network devices, like firewalls and load balancers, failures can include overflow of connection-tracking state of the flows they process. Fourth, Client-sourced failures include network traffic surges and DDoS attacks. Less common, but not without precedent, are poison-pill packets that cause network device failures. Fifth, are Dependency failures, which are less common for network devices, but can include things like DNS or monitoring services which are used by administrators to detect failures and access administrator functions of network devices. Network devices minimize use of critical dependencies, so dependency failures usually aren’t the primary source of failure. Dependency failures can extend recovery times.

Infrastructure Failures

Physical fiber cuts and other failures that result in link-down signals can recover as fast ~10 ms, though it can take 3 to 5 minutes if higher level routing protocols are involved and route convergence is required. On cloud networks, this will usually cause a subset of connections to lose connectivity until automated systems shift traffic to healthy paths.

Cloud providers have lots and lots of network devices and links so these kinds of failures are common, but the scope is usually small and recovery is usually fast. Cloud providers design assuming devices will fail and plan capacity accordingly, so secondary impacts like network congestion are less common for this type of failure, but they can happen if the device is managing a disproportionately large amount of traffic as was the case for this GCP failure.

Compared to intermittent packet loss type failures, infrastructure types of failures are easy to detect with simple health checks. Remember that the failure will impact a subset of connections, so you need to monitor each connection directly. Failures with 10ms to 1 second recovery times pair very nicely with TCP retransmit default configurations, so application connections will likely recover on their own with a couple additional seconds of latency. You don’t need to do anything to set this up, your OS defaults will work. However, application layer request timeout and retry is necessary to recover connections when network recovery times take longer than a few seconds. Note this retry logic will need to use a different connection to increase the odds that the retry request will bypass the impaired path and don’t forget to throttle or back off retry requests to prevent retry storms.

Code & Config Failures

Recovery times from failures in code and configuration varies. Some providers have very good automation for deploying small-scope changes like device firmware updates. With good device-level monitoring, problems are detected and resolved in ~5 minutes. Large scope changes are a different story. Sometimes route configurations have very large or even global scopes and may be difficult to detect and rollback, causing prolonged multi-hour outages. One of the biggest recent examples comes from Facebook. They made a change that deployed globally to their network, brought their DNS infrastructure offline and prevented administrator access to their systems, leading to a very large multi-hour outage.

Small-scope change related failures are most likely to show up a loss of connectivity for a subset of your connections. Large-scope failures may cause full loss of connectivity and/or packet loss.

Ask your vendor about the scope of their changes. You can place redundant application instances in different change domains to ensure that any change related failures caused by the vendor will only impact a subset of your instances. Cloud providers should avoid globally-scoped changes and implement strict fault-domain partitioning and small-scoped changes as much as possible. If you find your vendor is making many globally-scoped changes or does a poor job of isolating failure, you can place redundant application instances into another vendor’s infrastructure or switch to a vendor with better automation and smaller scopes of change.

Data & State Failures

When state-related failures occur, resolution almost always involves operator intervention, so recovery times take 15 minutes to many hours to resolve. Complex routing tables can hit limits in network devices, though this usually limits impact to new routes. Stateful network devices, like load balancers and firewalls, must manage connection states, and can potentially hit limits if they aren’t carefully managed. Again, this often impacts new connections rather than established connections, but these kinds of failures have the potential to bubble up and cause secondary effects that can impact existing connections too.

If a vendor using a data sharding strategy and you understand the scope of data and state for the network services your application depends on, you can use redundant app instances and service instances to protect your application. For example, each instance of a network load balancer is likely to be assigned to different underlying service layers with separate connection-tracking states. When you use two network load balancers redundantly, a state-related failure may impact one, but not the other. However, be aware that if your load balancer configuration data, or states related to your use of a load balancer is causing the problem, it will affect all of your load balancer instances at the same time. Load balancer redundancy won’t protect you here, but it can sometimes protect you from problems related to vendor managed state, or state related to processing other customer workloads.

Similar to change-domains, understanding the scope of data and state shared across vendor network devices can help you determine where to place redundant instances of your application. Place your redundant application instances into diverse and distict scope of shared state, when possible. For AWS, this means using multi-Region, or multi-AZ deployments and sometimes means using multiple redundant instances of a service.

Client-Sourced Failures (Network Congestion)

Cloud networks are multi-tenant, lots of customers share network links. The Internet is also a multi-tenant network. Congestion can be caused by surges in traffic from legitimate client usages and malicious events like DDoS attacks. Congestion can also occur as a secondary effect of link or device level failures if redundant links are improperly scaled. In all cases, impacts show up as increasing levels of packet loss for a subset of connections.

Automated resolution of network congestion is very hard to implement, though a few providers do a good job here. The nature of congestion and route rebalancing means that resolving network congestion events can take 3 minutes to 15 minutes to fully resolve. Providers with little or no automation must rebalance traffic manually, which can take 15 minutes to an hour or more to resolve. Don’t assume your vendor has automation here, ask. There are no off the shelf tools for this, automation requires a lot of ongoing investment from a vendor.

Client sourced failures also include rare poison pill requests, which can cause routers to crash or otherwise stop packet processing. These events are triggered when client-source packets interact with a bug in the network device. These events are very challenging to diagnose and mitigate. Often mitigation requires the operator to identify the particular traffic causing the problem, then stop the traffic with filter rules or route updates to tarpit the problematic packets. The trouble with client-sourced failures is that shifting traffic to redundant devices will only cause the problem to spread. These events often take hours to mitigate. Some automation is possible, but is even harder to implement than congestion mitigation. Understanding client-traffic isolation or sharding schemes used by your vendor can help you assess this risk.

Using multiple connections in parallel with maximum path diversity can mitigate congestion type failures. Path diversity can also protect you from poison pill requests by increasing the odds your traffic won’t be co-mingled with the problematic traffic. AWS SRD is an example of a turn-key solution you can use in AWS. Network congestion is difficult for you to detect, but monitoring the retransmit statistics on your hosts can clue you into network congestion events.

Dependencies

Network devices usually have lots of protection for packet processing layers using data plane design patterns. When providers follow strict dataplane design strategies, failures of dependencies rarely impact packet processing. Often dependency failures will extend time to recover from one of the above types of failures. For example, if monitoring fails, automated congestion recovery may break, extending time to recover from ~7 minutes to tens of minutes or hours.

Probability of failure in Cloud environments

More devices means higher probability of failure. In a traditional datacenter, a host to host connection might traverse as few as two devices for intra datacenter communication, and four devices for inter-datacenter communication. In a cloud environment, each connection can easily involve ten times the number of devices, but the same concept applies. Local point to points will involve fewer devices than remote point to points. The massive scale of cloud provider networks requires use of many more devices than what a traditional datacenter uses. Though this increases the probability of failure, failures are smaller in scope and may not cause any impact at all to your application’s connections.

There are two dimensions that affect the probability of network failure. The first dimension is depth, which is the number of network devices through which packets must flow. The second element that affects probability is the width of each tier. The width of the tier is the number of redundant devices across which network traffic is load balancer. First we’ll talk about factors that increase or decrease depth.

Probability of failure increases with depth of tiers of devices

The first way probability of network failures increase is by the depth, or the number of device tiers, between two hosts. In AWS, the most local point to point is between two EC2 instances in the same Availability Zone (AZs) and launched using a cluster placement group. Two hosts in two different AZs within the same AWS Region adds additional tiers of network devices. Two hosts in two different Regions adds still more tiers. A host in a private datacenter connected via Direct Connect to a host in an AWS Region similarly adds more tiers of network devices.

You also increase the depth of network device tiers as you use more services and features within a cloud environment. Things like Private Link, Transit Gateway, NAT Gateway, Network Load Balancer, Network Firewall, etc, all introduce additional network tiers between your hosts. Just as in a traditional datacenter, adding a firewall and load balancer introduces an additional point of failure in your network flows.

You can decrease network disruptions by colocating your hosts in a single AZ, this minimizes the network tiers between your hosts. You will increase the network disruptions when you separate hosts across Regions or AZs, by adding more network tiers. However, running across multiple AZs or Regions is necessary to mitigate AZ or Regional failures. For the vast majority of applications, running in three AZs in a single Region is the right balance of trade-offs.

Second, you can decrease the probability of failure by using fewer network-layer services. However, this also comes with a trade-off. Security devices protect your systems from malicious bad actors. Services like NLB are designed to load balance and automatically recover from application failures. By not using these services, you risk security compromise or may not be set up to recover from failure.

Depth of network tiers increases probability of failure

For the most-critical cluster services, use the minimum number of services for cluster node to cluster node communication, and be intentional about placing your cluster nodes. Place nodes in multiple AZs or Regions for maximum redundancy but plan for more network disruptions. Place all nodes within the same AZ or Region less redundancy but fewer network disruptions. Client to cluster node communication can and should flow through critical security layers, well-designed retry logic can overcome network disruptions for these flows

Probability of individual connection failures decrease with wider tiers

The second way probability of failure increases is as more redundant devices are added to a network tier. This effect applies to any distributed system, and presents as a trade-off decision. If you are load balancing network traffic across three redundant devices, you have 3 times the probability of failure compared to a network tier with only one network device. However when failures occur, each of your network flows has a 33% chance of impact, compared to the single-device tier where every flow has a 100% chance of impact. Multiple redundant devices also means the system can shift traffic to other healthy devices to recover from a device failure.

Width of network tiers decreases probability of failure

In a Cloud environment, most providers have not less than three redundant network devices and as many as thousands of redundant devices at each tier. With one thousand devices in a tier, device failures are likely to occur, however each of your flows has a one in one thousand chance of being impacted. With one thousand redundant devices, your application is unlikely to see any impact when a network device failure occurs.

You have little to no control over the number of redundant devices a cloud provider uses in their underlay networks. Underlay networks are the actual physical network devices that process packets. The exceptions are services like AWS Direct Connect, which you can use to directly connect your network devices to Cloud networks. Sometimes you may be able to control the number of redundant devices via capacity controls, where bigger scale may increase the number of redundant network devices and smaller scale reduces the number of redundant network devices.

In some cases, cloud vendors may provide you with information about the number of devices that exist within a given tier of their network. With that information, you can make some predictions about the probable scope of impact when failures occur. This information can help you decide whether or not you should invest engineering effort to build redundant connections between your application instances or any of the other recommendations below.

What to do about it (long version)

Do nothing If you’ve been running in the cloud for a number of years, then you already know what kind of availability you are getting from the network and network services you are using. For most applications, what you get out of the box works pretty well. You can do a lot to improve resilience by following best practices offered by your cloud provider. If you aren’t happy with the baseline level of resilience, then start with a good three AZ, single Region architecture and consider using one or more of the below techniques to achieve your resilience objectives.

Validate health on a per-connection basis to detect network failures that impact a subset of network flows. This is not easy, but it is possible. If you’re using HTTP/2 with a single connection, generating your own health check along with requests means you will be monitoring the actual network flow your application depends on. This gets a lot more challenging if you’re using 3rd party software with its own health check strategy. You can directly monitor TCP retransmits on host, then force health check failures when partial network impacts are detected. Using direct connection monitoring can mitigate another issue with health checks; typical health check configurations can’t reliably detect anything less than 70% failure rates. Network congestion has to approach something close to a total outage before a health check will fire. You should, at minimum, analyze the health check and failure detection strategy used by the cluster software your application depends on.

Use a new connection on retry. I’ve had luck with getting gRPC on Python to retry on new connections, but it is a bit tricky. Other client software may not support this approach. Please do be careful, if there’s one thing worse than a retry storm, it’s a retry storm that creates new connections on each and every attempt. Retry storms can easily lead to a self-DDoS if you don’t implement a good back off strategy.

Use parallel requests on different connections or use AWS SRD. You can gain full advantage of a cloud network’s massive scale and multiple parallel paths when you issue parallel redundant requests between hosts. You may also make use of AWS features like SRD to gain the same advantage.

Distribute service endpoints around vendor fault domains, including physical infrastructure, code and configuration change scopes, and shared state and data. In AWS, Availability Zones and Regions are change scopes, even for most network related changes. Be intentional about how you deploy redundant capacity across these fault domains. Most applications should run in three AZs in a single Region. If you must run redundantly across far-away datacenters, like multiple AWS Regions, do your best to leverage eventual consistency strategies, or asynchronous stand-by data replicas. If you need strict data consistency, your application will run more reliably and faster with a PAXOS/RAFT cluster, versus a database using two-Region synchronous replication. With PAXOS/RAFT, you will also run more reliably with three Regions versus two. Using three instead of two also minimizes the chances of a split brain.

Test your application with simulated partial network failures. You might not know how your application will behave when network failures occur without testing. At a minimum, use testing to see how network failures show up in your monitoring. Ideally, you can use testing to refine your network configuration and application retry logic to automatically recover from network failure events. If you are running data clusters, generate representative read and write transactions whilst you run through the scenarios below. Some impacts may show up as increased write latency and reduced transaction throughput. So you’ll need that throughput to detect impact.

Simulate device failures:
1. Random 5% of connections increasing to 33% of with total packet loss
Simulate network congestion:
1. Random 5% of connections with 5% packet loss to 50% packet loss
2. Random 33% of connection with 5% of packet loss to 50% packet loss
Simulate data/state failures:
1. Prevent ability to establish new connections

Conclusion

Cloud networks are different. Most cloud applications will benefit from smaller scale network failures and automated recovery built into the cloud. However, some applications may be sensitive to these more frequent but smaller scale failures. You can improve your application resilience by changing the way clients connect to service layers.

I’d like to hear what you think. Would you like a follow-up showing examples of the techniques described in this article? Drop me a comment with your thoughts or send me an email. Thanks for reading!