March 23, 2020

How to use RTT metrics to improve streaming performance

Colin Rasor, Director of Traffic and Performance Management, Verizon Media

Quality streaming of video depends on millions of things going right, such as managing a constantly fluctuating workload or dealing with "flash crowds" when large numbers of viewers enter a stream at the same time. It’s why delivering a reliable, high-quality video stream as part of a paid service – where viewers expect a TV-like experience – requires tools and metrics to finely articulate performance challenges so you can know where and how to fix issues.

CDNs are an indispensable solution in video streaming because they provide low-latency scalability on-demand around the world. In addition to the optimizations that enhance the way the CDN balances the chaotic audience growth that can accompany a live stream, delivering great performance to the end user requires an additional layer of visibility, metrics, tools, and automation.  

In this post, we'll review examples of recent performance optimizations we made for a large North American vMVPD streaming service, including:

  1. Metrics we use to improve/fix performance problems
  2. How to define performance and how to measure it
  3. Ongoing optimizations we take to improve video performance

Autonomous System Numbers: The complexity behind the curtain

The modern web is built around multiple interconnected layers of networks known as ASNs (autonomous system numbers), each consisting of large groups of IP addresses, coupled with CDNs (content delivery networks) that reduce congestion by making content available at the edge. Essentially, as shown below, the Internet consists of a network of networks, each with its unique business model and priorities.

Source: Research Gate

Coupled with the inherent complexity that goes with ASNs interacting with each other is their sheer scope and scale. The vizAS tool attempts to visualize the interconnections among the many ASNs on a country-by-country basis. For example,  in the U.S. alone, there are 16,979 ASNs and 24,905 peering relationships across the networks. Globally, there are more than 90,000 ASNs.

https://stats.apnic.net/vizas/index.html

From our perspective as a CDN provider, the complexity involved with connecting to thousands of ASNs is compounded by the need to accommodate each customer’s performance requirements, usage profile, type of traffic, and many other factors. For example, a service like Twitter is going to have a much different usage profile or footprint than a gaming company pushing out a major update. Similarly, a broadcaster streaming a live sporting event has a different profile than an on-demand streaming service. Even two separate live streaming services are likely to have different profiles, with each requiring tailored performance optimization.

Behind the scenes, we have a large number of configuration settings that we can adjust and tune to help customers achieve their performance goals and requirements. For some, the performance may be what they expected (or better) from the get-go, and we don’t need to change anything. Others may have specific goals, such as faster TTFB (Time To First Byte), an important metric for streaming services, that need to be addressed.

Given the complexity and size of the Internet, it’s impossible to deliver useful and consistent performance improvements though “whack-a-mole” or scattershot approaches. Real gains come through comprehensive data gathering and intensive data analysis to identify the root cause of a problem or to proactively gain insight into the network configuration changes that will benefit a customer the most.

Delivering RTT insights with Stargazer

One of the most important metrics in determining network latency and overall performance health is RTT (round-trip time). This is defined as the duration, measured in milliseconds, it takes a packet to travel from source to destination, and a response to be sent back to the source. It can be used to diagnose and improve network performance across a number of vectors, including congestion, hardware issues, misconfigurations, and routing issues.

Given the importance of this metric, we have built an internal system called Stargazer that we use to aggregate RTT data from a variety of sources, including our sensors, data we import from customers, and third parties who also monitor RTT information. Stargazer monitors outbound response times going to the client. Other data sources can include BGP (Border Gateway Protocol) tables from routers, mapping of routers to geographic locations, RTT logs for connections from clients, and traffic volume information. Additionally, the system can perform active probes such as traceroutes and pings when necessary.

Behind the monitoring activity, Stargazer, in conjunction with other tools we’ve developed, enables us to query all the data we’ve collected and to perform in-depth analysis that opens the door to continual improvements. Our network administrators can use this information to isolate problems and identify network routes or configurations to optimize performance for specific customer goals and requirements. And, as will be discussed later, it’s also useful for understanding the effect new technologies such as the BBR (Bottleneck Bandwidth and Round-trip propagation time) protocol have on performance.

Optimizing an origin server

To provide more insight into how performance optimization works in practice, let’s take a look at an example involving a recently added live video streaming customer who needed to optimize for a multi-CDN architecture. In a traditional CDN client architecture, the client makes a request to one of our PoPs, and the PoP cache fills from the origin, as shown below.

However, this customer chose to take advantage of a multi-CDN architecture to increase redundancy and resiliency and potentially increase performance. This is enabled by our Origin Shield Architecture shown in Figure 4. Origin Shield offers more control over how various clients’ traffic can be routed for best performance.

Unlike a traditional CDN relationship, all traffic, including that served by the second CDN, flows back to one of our PoPs (AGA) located in Atlanta for cache fill. The AGA PoP then serves as the origin, or more specifically, what’s known as the origin shield, relieving considerable burden from the customer’s origin server. The AGA PoP was chosen as the origin shield location due to its overall higher cache-hit ratio and performance compared to the second CDN. A major concern, however, was optimizing performance between the two CDNs.

The first step in the process was to look into optimizing the routes taken by the second CDN to AGA with it acting as the origin shield. One issue that became immediately apparent was poor performance between the CDNs and a high number of connection timeouts impacting TTFB. To dig into the performance issues, we used Stargazer to send a traceroute from AGA to the intended destination and capture the ASNs used for every hop along the way.

As shown in the summary below, a traceroute was sent from AGA to an IP address at the second CDN, simulating the path a client would use.

We noticed that several of the hops were within ASN 7018, which was not the preferred route because it involved more hops and had more congestion. Using Stargazer, we quickly confirmed the problem and made the appropriate network changes. As the traceroute summary below shows, after making the network change, we decreased the hop count and improved RTT by switching over to ASN 7922, which also resolved the issue with TTFB timeouts.

In a different example, we used the Stargazer tool to determine the best origin shield path to a customer’s origin server located on the east coast. To be effective at reducing the load on a customer’s origin and improving delivery performance, the origin shield PoP should be close to the origin. In some cases, it’s not necessarily the closest physical PoP that works best. It’s a combination of the fewest ASNs, the least amount of congestion, and low RTT times. As you can see in the before and after chart below, a Stargazer traceroute showed that moving from the NYA (New York) PoP to the DCC (Washington, D.C.) PoP reduced the hop count to three, resulting in an overall improvement in RTT performance.

Deeper analysis with Sweeper Fish

With 5,000+ Interconnects and over 150+ PoPs across our CDN globally, there is plenty of ongoing optimization work. To keep from spinning our wheels on tasks that may not make much difference, we needed an efficient way to prioritize the actions taken by our operational teams. This need led to the development of a companion tool to Stargazer called Sweeper Fish that scores the places where problems exist and allow us to bubble them up in a prioritized way.

Sweeper Fish is also useful for determining if a route is congested and whether it’s likely to cause problems. Going back to the multi-CDN example, we used Sweeper Fish to discover which routes had congestion. To do this, Sweeper Fish measured the delta between the 25th and 75th percentile for RUM (Real User Measurement) data for all clients over all paths to the AGA PoP, focusing on the second CDN’s path to us, ASN7922. The standard deviation for all traffic over this ASN is shown below.

We also confirmed that the previous configuration through ASN7018 was not the way to go.  Sweeper Fish analysis showed that the IQR (InterQuartile Range) spiked to 20-60 to milliseconds (see Figure 9) due to congestion on this route. IQR, also called the midspread or middle 50%, provides a useful way to quickly analyze a route and flag problems. The lower the IQR, the better.

In contrast, the AGA PoP was consistently between 10-22 milliseconds on the route we ended up using, ASN7922, as shown below. By comparing different ASNs, Sweeper Fish enables us to choose the best, least congested route and/or identify issues for remediation.

TCP tuning

Congestion also causes packets to be dropped and retransmitted. This occurs when the paths between PoPs are distant, and the TCP algorithm used is not optimized. Since AGA was serving as the origin in our example, it was possible for distant PoPs that reached AGA to have this issue. Though spread across many PoPs, the aggregated CDN logs shown below enabled us to identify problem areas as indicated by the boxes.

Figure 11. Aggregated server logs quickly identify problem areas where packets are being dropped and retransmitted.
Figure 11. Aggregated server logs quickly identify problem areas where packets are being dropped and retransmitted.

Evaluating BBR

Bottleneck Bandwidth and Round-trip propagation time (BBR) is an alternative TCP congestion control algorithm developed by Google that has started to gain some traction. It differs from loss-based congestion control algorithms, such as the ubiquitous TCP-Cubic, and introduces a different operation paradigm. In essence, it continuously updates how much data can be in flight based on the minimum RTT the session has seen so far to avoid buffer bloat.

In our analysis of BBR as detailed in this blog post, we found that BBR is a useful tool for improving RTT performance, but falls short of a universal solution. There are some times where you want to be using it and other times where you don't. Stargazer helps us determine when we want to be using it by tracking the consistent profile of RTTs to destinations over periods of time. This allows us to determine the best places to apply BBR to help reduce retransmits and improve flow control.

Based on the analysis shown in the charts below, we concluded that switching to BBR would slightly improve performance for the AGA PoP to the second CDN and customer origin. The first set of graphs shows that changing from TCP-Cubic to TCP BBR resulted in a decrease in retransmits, while the second set of graphs indicates that the change to BBR provided a slight increase in average throughput.

Figure 12. In this example, changing from TCP-Cubic to TCP BBR resulted in a decrease in retransmits
Figure 12. In this example, changing from TCP-Cubic to TCP BBR resulted in a decrease in retransmits

Figure 13. In this example, switching to BBR for flow control from TCP-Cubic for the AGA PoP reduced retransmits and slightly improved average throughput.
Figure 13. In this example, switching to BBR for flow control from TCP-Cubic for the AGA PoP reduced retransmits and slightly improved average throughput.

Conclusion

The Internet is both vast and complex – it is essentially a network of networks. For Verizon Media, optimizing performance for customer use cases would be next to impossible without in-depth analytics to gain insight into problem areas and to test possible configuration changes. To improve performance for our customers, we have developed a robust set of tools to continually monitor RTTs so we can quickly and efficiently make improvements across our network. For a live video streaming service, we found ways to optimize performance for their unique requirements, while also evaluating the use of BBR for their application. The result was a high-performance streaming solution leveraging two CDNs. And we’re not done yet. As network congestion continues to increase, we will never stop optimizing our network to ensure our customers and their clients have the best online experience possible.

Contact us
Contact a rep
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sales

Support

Manage your account or get tools and information.

More info