How to Analyze the Results of a Large-Scale Load Test

The following is another good article on load testing by Alon Girmonsky.

"Running a large-scale load test involves multiple challenges. The first challenge is to create the right testing environment that can realistically simulate thousands of concurrent users.

To simulate such a load realistically requires tens of dedicated servers residing on a high bandwidth network.

Any compromise with the testing environment can cause the test to appear to have excellent results providing false confidence when in fact the testing environment did not have sufficient strength to provide a realistic simulation.

Two days ago, Tescom Singapore, one of PerformanceXpert's valued re-sellers in the far east ran a 12,000 users load test.

View Report

The large-scale test was comprised out of 45 dedicated servers scattered all over the world together simulating a load of 12,000 concurrent users.

With PerformanceXpert each dedicated server enjoys

a dual CPU, 1.7GB of memory and up to 100Mbps of bandwidth connection.

Tescom created the load script that simulated several groups of users, each executing a different business process. For example: 10% of the users will log in and view some pages, 30% will search and read articles and the rest will do general browsing. All groups will operate in parallel creating the most realistic simulation related to the website under test.

The script included a gradual load scenario which is always good to use as it is very helpful in identifying problems due to the very gradual increase in the load.

The test results were quite surprising and also educational. It is sufficient to have a quick look at the reports to see that the graphs yield very interesting conclusions.

I will use this article to describe these conclusions and illustrate a suggested approach while analyzing such reports.

First Step - Look at the Over All Average Report
My first step would be to look at the Response Time Vs Users for All transactions involved in the test. By "All" I mean the average response time of all requests including pages, images, CSS and JS files etc.

After having a quick look at the report, it became obvious that there was a problem. From the report it seems that at about 07:42:00 GMT the response time began to change and started to increase. Up to that time, the average response time was about the same. Up to almost 2,000 concurrent users, the response time was steady at a level of about 600ms.

Looking at the above report several points become obvious:

A - Average response time while the website under test is still not sensitive to load. We call it Idle Time. This is the average response time when only a few users are visiting the website under test and up to the point the website under test begins to be sensitive to the load. In this case it's about 600ms.

B - The point where the website under test becomes sensitive to load. We call it the Load Sensitivity Point. From that point the response time started to increase as the load increased.

C - The absolute time of the Load Sensitivity Point. In this case, it was 07:42:00 GMT. This point enables one to identify the number of users that were accessing the website under test at the point where the website under test became sensitive to load.

D - The number of users accessing the website under test during at the Load Sensitivity Point.

Step Two - Look for a Bandwidth Bottleneck
It's not always the case that a problem necessarily results from a bandwidth bottleneck, however, a bandwidth bottleneck is very easy to find. For this we need to look at the throughput report.

Looking at the reports, there is an obvious bottleneck that is most certainly related to bandwidth. With a normal test (a test without bottlenecks), the throughput consumption would have increased and reached its limit only when the test reached its full capacity. In this case the throughput consumption should have continued to increase until the full 12,000 users were accessing the website under test. The full load capacity was reached at 09:26:00 GMT, while the bandwidth consumption reached its limit at 07:40:00 GMT. The probable reason is a bottleneck.

Looking at the above report several points become obvious:

A - The potential throughput limitation. In this case it's close to 1.4GB per minute which is calculated to ~187Mbps. This is only a potential bottleneck. It should still be verified.

B - The point in time when the bandwidth reached its limit. This point will help us to identify the number of users that were accessing the website under test at that point. It's no coincidence that this is the Load Sensitivity Point we mentioned earlier.

C - See that there is an actual limit for the bandwidth in the test.

Although it is obvious there is a bottleneck, it is not obvious that the bottleneck is related to bandwidth. There is an easy two-step process to determine if the bottleneck is related to bandwidth.

Test with a browser from an external location during the load. With a bottleneck related to bandwidth, the perceived behavior should resemble the one in the report (i.e. very high response time).
Test with a browser with in the same LAN of the website under test. If the results are better, it means that the limitation of the connection of the LAN to the WAN was probably reached or in other words - bandwidth bottleneck. If the result is the same as indicated in the report it can mean a different bottleneck probably not related to bandwidth.
During this test, Tescom tested with a browser within the LAN and the response time was very good. Testing from an external location presented poor results which is the same as what appeared in the real time report. This confirmed that the bottleneck is related to bandwidth.

Step Three - Looking for Errors
Reported errors can teach us a lot about the website under test performance. The most educational errors to encounter are 5XX errors that actually tell us about the system status. However, usually the case is that the website under test would stop responding before even generating any errors. In this case we will see many timeout or disconnection errors, as this was the case in this test. At a certain point the website under test stopped responding all together. Apparently it crashed at the point of 9,500 users.

At 08:58:00 GMT, numerous errors of type connection timed out and socket errors were found resulting from the website under test not responding any more. At that time about 9,500 users were accessing the website.

Step Four - Correlating the Load Report with the User Experience Report
It is very important to correlate the load report with the user experience report. PerformanceXpert automates two different systems to get comprehensive reports. The first system is for the load. Based on JMeter, PerformanceXpert launches numerous servers that generate a load according to a load script. In parallel, PerformanceXpert uses a different system based on Selenium to automate the launch of real browsers during the load period to measure render times and other KPIs as they are perceived by a real browser. The two systems are not connected but work in parallel to complement one another."

A good article on load testing.