Release judgment call: Is the new code production ready?

By Alejandro Proaño, Core Engineering Intern

At Verizon Digital Media Services, our team of engineers is continuously working on research, development, maintenance and deployment of software that allows us to meet the demands of our customers and the CDN market. This process is closely monitored to ensure that it produces the expected results and does not affect the normal operations of our customers.

Once the new version of code is developed, it goes through multiple layers of unit and load tests. Furthermore, we test the new code under a small sample of production traffic. We collect and monitor several metrics that help us evaluate the performance of the new code at each stage of testing and deployment.

Moreover, we carefully compare the performance of the new code with the old one, looking for anomalies in the behavior of our network that could negatively affect our clients and customers operations.

However, the comparison between the two versions of the code bears some challenges. For instance, consider the following figure in which we show the behavior of the performance metric “Cache Mbits” over a two-day span. Here, we divide our servers in Miami into two groups, and both groups run the same version of the code.

image

We make two important observations: (a) the behavior of the metric is cyclic over a period of 1 day, and (b) even though the two curves follow a similar tendency, at certain times of the day, there is a significant offset between the performances of the two groups of servers.

In both cases, these differences are due to the dependence of traffic to the time of the day, the difference in the loads of traffic that each group of servers handles as well as the hardware differences between the two groups. Therefore, if we compare the versions of the code in real time, it becomes difficult to distinguish between these natural variations from those caused by the new code.

To perform the comparison, we choose a control and a test group of servers. The control set keeps the old version of the code (which we represent with A), while the test set runs the new version (represented by B). We divide time in at least two complete traffic cycles (e.g., 1 day). During the first cycle both groups run A, and during the second cycle, the control set runs A and the test set runs B. This is illustrated in the following figure.

image

In order to detect the impact of the new code on the system, we divide time in small periods (e.g., 1 hour) and compare the difference between the two sets for each of these periods. The differences estimated during the first traffic cycle are used to obtain the offset between the two groups of servers, so that we can subtract it when we compare the two versions of the code. For instance, in the figure above we estimate the offset during period T11, and during period T12 we get the change (if any) caused by the new version.

Finally, we use probabilistic and information theoretic techniques to detect the anomalies introduced by the new code, as well as to quantify them. For example, the following is a figure in which we estimate the change between two sets of servers in real time.

image

Here we have two groups of servers in Miami, for which we collected data of the metric for 6 consecutive days. We release the new code to the test set at 5 pm of the 4th day.

We can see that the difference between the two groups of servers oscillates around 0 for the times prior to the release. However, as soon as the new code is deployed, this tendency changes making the difference go below its zero threshold. The red line shows the new tendency of the difference.

Even though the plot helps us to detect a change due to the new code release, it is not an indicator of how good or bad the metric is performing. In this case, when a change its detected by the system, we get the change on the metric’s statistics as well. For example, for the plot above we obtain the following statistics.

Mean-Pre (C-A),(T-A) = 117.49, 129.50

Mean-Post (C-A),(T-B) = 111.88, 135.47

Std-Pre (C-A),(T-A) = 48.73, 86.30

Std-Post (C-A),(T-B) = 58.91, 89.31

Change-Mean (A/B) = 10.35%

In the example, C and T represent the control and test sets, and Pre and Post the period before and after the release of B, respectively. From this data, we see that the difference detected in the plot is equivalent to an improvement of 10.35% when we upgrade from version A to version B.

At the end of this process, we are able to determine the impact of the new code in our system. This is important since it helps us determine the convenience of a full deployment of the code, and it makes it easy to keep record of the performance advantages of each of our new code releases.

About Alejandro Proaño

Alejandro Proaño is a Ph.D. candidate in the Electrical and Computer Engineering department at the University of Arizona. He received the M.S. degree from the same department in 2010 and the B.S. degrees in Electrical Engineering and Mathematics from Universidad San Francisco de Quito, Ecuador in 2008. During the Summer 2013, Alejandro was a Core Engineering Intern at Edgecast. His research interests include denial-of-service attacks in wireless networks, privacy protection in wireless communications, as well as performance analysis in computer networks.

×