Four Ways to Avoid Cascading Failures on a CDN
By Dave Andrews, CDN Architect & Evangelist
It’s the worst case scenario for any content delivery network (CDN). First, one server experiences a surge of traffic that puts it over capacity, and it goes down. Now, other servers on the CDN — their number cut by one — must compensate for this crashed server. They must take on more of the load (traffic), but not all of them can. As servers topple like dominos, there’s a chance the entire point of presence (PoP) on the CDN will go dark, and that the whole server failure pattern will replicate itself at the PoP level, with entire PoPs getting overloaded instead of only servers. The risk of regional or even global service outages increases with every passing moment.
The situation I just described is called a cascading failure, and it’s the stuff of nightmares for any CDN provider.
It’s also more likely to occur now than it ever has been before.
CDN usage is growing faster than internet use, likely driven by the increased consumption of OTT video worldwide, as a recent Cisco study revealed. In developing markets, more and more people have access to broadband internet, allowing them to stream movies and TV shows for the first time. In established markets, more people are cutting the cord and cancelling cable, choosing to watch OTT video content instead. For CDNs, the combination of these two trends can create strain, if network providers don’t plan ahead for increased traffic.
CDNs have become non-negotiable components of the modern web, which means their continued uptime is vital to an enormous number of customers. Here are some solutions we here at Verizon Digital Media Services look at to avoid the worst-case scenario of a cascading failure.
Every component in your system has its limits. Load testing is about understanding exactly where those limits are, and how your performance profile changes as different components max out. What you want to see is a performance profile that plateaus at a constant rate; what you want to avoid is a system where performance tanks when a component hits its limit.
We’ve found that a best practice is the combination of rigorous load testing in a controlled lab environment, coupled with very careful monitoring of performance during deployments to production. After all, there is no load test quite like production. This approach provides the visibility and confidence at every step of the way for us to know that performance is either being maintained or improved. Because we are always looking at production, we also get constant feedback into how our load testing results are changing over time.
To safely facilitate the 20 or so deployments we do on the network each and every week, we use a program called CoalMine. CoalMine allows us to canary our changes in several levels of staging environments – from dedicated testing but production-identical PoPs, to shadow PoPs receiving replayed real production traffic (using another internal system called Ghostfish), and slowly out to actual production. CoalMine also enables immediate visibility into how new code or configuration impacts any of the hundreds of metrics we collect for each server in our network. Each and every deployment we do goes through this critical process, which allows us to catch performance issues and keep their impact insignificant.
Monitoring and Alerting
Once you understand the limits of your system and have confidence in a lower-bound on performance when any part of that system runs into its limits, the next step is to ensure you actually know when it’s happening. Having monitoring and alerting in place allows you to understand when anything approaches its limit. This advanced warning affords you the opportunity to take corrective action before any performance issues or failures occur.
Often the first corrective action we take is to ensure that traffic is being optimally distributed to avoid overloading any one component. For example, if only some servers in a PoP are hot, traffic can be shaped to be more evenly distributed and to bring performance back up to normal.
This same concept applies to larger scales also. Within a PoP, load can also be balanced amongst network circuits; at a regional scale, traffic can be balanced between multiple PoPs; and at a global scale, we can adjust regional balances.
In almost every single case, these shaping actions are enough to mitigate any risk of poor performance or failure.
Proper planning for disaster is a pretty good way to avoid it. Utilizing the significant capacity of the Edgecast CDN(30+Tbps at the time of writing), we have never experienced a cascading failure scenario at scale, but we still make it a point to think about it.
We think through different component failures at a variety of scales — within a single system, in a PoP, and globally — to ensure that we have strategies in place to contain a failure to the smallest possible area.
In a complex system of many components, this can be achieved by limiting or reducing the interconnectedness of the components, and ensuring that the impact that one component can have on another has a controlled boundary.
For example, in a single system, using Linux control groups allows us to logically wall off critical processes and guarantee that any issues from less important, internal-facing utilities don’t interfere with delivering the best possible performance for our customers.
At a global level, we use DNS to select regional collections of PoPs to serve a given request, and then anycast to select the closest PoP. Initially, we did this for performance reasons — to address some of the inherent limitations of purely anycast-based request routing and to ensure that requests are routed to aPoP with an acceptable round-trip time. This approach also keeps traffic regionalized — that is, making sure that if PoPs start to enter a degraded state in Europe, PoPs in Africa don’t jump in and try to compensate.
In reality, many containment techniques, including the ones just mentioned, are layered together to make it exceedingly unlikely that a cascading failure will spread.
The goal of every CDN provider should be to create resilient systems that maximize performance and minimize liabilities. Fortunately, implementing established best practices while also constantly looking for ways to develop new ones can help even the largest CDNs like ours stay strong.
Want to discover how our Edgecast CDN’s new agile management toolset puts the power and control at your fingertips? Come see us at Velocity Booth #801 and schedule a meeting with us.
Dave Andrews will also speak all about avoiding cascading failures at Velocity in San Jose, Thursday, June 22, 2017, 9:20 a.m.