Measuring the Internet: TCP Time Machine

Measuring the Internet is a blog series about what we learned while measuring, scaling, and fine-tuning the performance of one of the Internet’s largest networks. Verizon’s content deliver network (CDN) serves billions of objects per hour on behalf of its customers and carries a significant percentage of global Internet traffic. We collect over a hundred terabytes of data points from thousands of servers daily to ensure the speed and reliability of the world’s busiest websites.

At Verizon Digital Media Services (formerly EdgeCast), we are firm believers in the power of data. We monitor all of our systems both internally and externally and collect mountains of performance metrics from them. We sample and monitor through independent third parties as well. They are key to understanding the state of the Internet at large, and more importantly, of capturing the interplay between our own systems and the public Internet to enable us to optimize and perform to a world class standard.

So, we naturally work very closely with many synthetic and RUM monitoring platforms, and maintain excellent professional relationships with them. These systems have been extremely useful in detecting, isolating, and troubleshooting problems in the past and have become important parts of our monitoring workflow.

But every once in awhile we get a complaint from customers using these third party monitoring platforms, pointing at a performance problem that does not make much sense to us.

In order to understand how third-party monitoring platforms measure performance, you have to understand some of their technical limitations.

Let’s take a look at this chart for a data transfer from Tokyo to San Jose:

image

This chart shows how long it takes to download a 1MB file hosted on one of our servers in Tokyo from one of our testing partners’ nodes in San Jose (we decided to reconstruct the graph because of our professional relationship with the third party platform used in this case).

Identical tests, different results

Both tests were IPv4 tests and also otherwise identical. We were surprised to see these two nodes on the same provider (AT&T) and in the same location (San Jose) to have an average difference of about 1 second (just under a 20% difference) in their download time for this object.

When we see behaviors like this, one of the first tools we reach for is a packet capture. We ran more tests while capturing packets, and one of the first observations was that even though there is no packet loss, the observed TCP performance is almost linear.

image

This is odd, because when there is no loss, we should be in the TCP Slow Start mode where we expect to ramp up the congestion window and have more packets in flight with every burst. Let’s compare by taking a closer look at an ideal packet capture:

image

That top vertical line is the receiver’s window. That is how much un-‘acked’ data we are allowed to send towards the client. We are constantly bumping up against this window, which means that the client (the third party monitoring platform’s Node 2) does not allow us to send more data.

Why is that? Why is Node 2 not accepting more data?

The answer is in the very first packet of this connection. This client does not support TCP Window Scaling:

image

What is Window Scaling:

Window scaling is a pretty old concept, so one would expect to see it implemented just about everywhere. It allows the sender to send more and more data with each TCP burst, assuming there are no issues in the receiver side buffer and all the packets are ‘acked’ properly.

This is not not be confused with the Congestion Window.

Once we saw this odd and unexpected behavior, we asked ourselves what percentage of today’s Internet has this feature disabled? We ran a large scale sampling process collecting from 5% of all of our servers world-wide, and noticed that almost 11.8% of today’s web clients have window scaling disabled.

image

Tomorrow’s performance, measured with yesterday’s tools

This makes us wonder why a third party monitoring platform is stuck in time and acting like a TCP time machine. If you ever wondered how Internet performance looked 12 years ago, all you need to do is to run a few tests from these third party nodes with no window scaling enabled.

We will work with the monitoring platform referenced in this case to resolve this issue. We believe impacted nodes are dual stack IPv4 and IPv6 nodes. The purpose of this article is to explain some of the challenges we have when our prospects and customers use these tools to evaluate and measure our performance and they don’t get an accurate picture of our true performance due to these kinds of third party monitoring issues.

×