Sherlock Holmes: Improving server loading time
You want to improve the server loading time. It’s important for the experience of your users and also for the SEO ranking of various search engines. Specifically, the metrics are the total loading time and the TTFB (time to first byte). The loading time of the HTML page (not including the other assets) depends on the size of the html, but the TTFB does not, since it’s the time it takes only for the first byte to be received.
The story was that I was asked to help improve the loading speed of a website which was hosted in AWS (us-east-1 region @N. Virginia). Let’s call this website slow.com. As a reference, I used another website, which loaded faster, hosted in the same AWS region: fast.com.
Doing a simple test from a laptop on random wifi network in a random location in the world yields these results:
slow.com | fast.com | |
---|---|---|
Lookup time | 0.126 | 0.125 |
Connect time | 0.284 | 0.130 |
Pretransfer | 0.872 | 0.199 |
Start transfer | 1.551 | 0.431 |
Total | 1.960 | 0.813 |
Times are accumulative and can be obtained using:
curl -s -w '\nLookup time:\t%{time_namelookup}\nConnect time:\t%{time_connect}\nPreXfer time:\t%{time_pretransfer}\nStartXfer time:\t%{time_starttransfer}\n\nTotal time:\t%{time_total}\n' -o /dev/null https://fast.com
So there’s a 1 second difference. How do you go about investigating the root cause for the difference? Naturally, the first thing you want to do is to check the actual server processing time of the request. Every additional milliseconds of server processing time will directly correlate to the total server loading time.
For a better and more stable comparison, the next tests were done from a London-based server (aws region eu-west-2), ~6,000km from N. Virginia.
Since both servers use Nginx+Apache web servers, I used the mod_headers (with %D option) to measure the server processing time. On fast.com it took 16ms and on slow.com it took 52ms. So there is a difference of 36ms, which unfortunately cannot explain the ~1000ms difference in total time. The culprit is probably elsewhere. When I looked deeper I saw that the %D does not include some operation system time, so maybe it’s something happening on the node(==server) after the Apache has done its job? Since the slow.com html size was twice more than fast.com (18k vs 8k), I wanted to eliminate this factor, so I did the same check for another page of slow.com which was also around 8k, but nothing was changed in the results. The html size is not the root cause here.
I knew that fast.com (the faster) had Cloudflare (reverse proxy & CDN) in front of it, and slow.com didn’t. So I wanted to compare the time it takes to reach fast.com with and without CF in front of it, but there was an issue that fast.com server was configured to server http, and the SSL support (for the https) was supplied by CF. So I had to compare http://fast.com time (directly to the origin server) vs https://fast.com (to CF servers). The first result was 941ms and the second was 975ms, so no actual difference. Note that the comparison is not legit since we’re comparing http to https to different locations.
Can it be that the server strength affects network speed? Can it be that AWS is throttling or limiting the bandwidth or latency according to the instance type? I saw that there is a limit on the packets-per-second depending on the instance type, but this can’t explain the latency (only affects upper bounds of throughput, which is not the case). I had to rule it out.
So I created 3 more server instances of stronger types (more vCPU, RAM, etc.) and benchmarked the results. In order to be able to send the request to a different server, you have to “spoof” the host name for the server to be able to respond correctly, and also bypass the SSL certificate. You can either do it via adding a record in /etc/hosts, which is not very convenient since it cannot be easily done programmatically, or by doing the curl command that customizes the DNS lookup:
curl -o /dev/null -s -w 'Total: %{time_total}s\n' --resolve slow.com:443:111.111.111.111 https://slow.com
Adding -k to the curl command would ignore the SSL certificate. Or, in python, use the requests_toolbelt package, this way:
import requests
from requests_toolbelt.adapters import host_header_ssl
s = requests.Session()
s.mount('https://', host_header_ssl.HostHeaderSSLAdapter())
response = s.get("https://111.111.111.111", headers={"Host": "slow.com"})
print(response.text)
In order to do so, you need to invoke the requests from the right place; in my case, London. It would have been nice to use an existing service like KeyCDN or GTmetrix, but I couldn’t see the option to override the DNS lookup in these services, and just entering there https://111.111.111.111 won’t help, because the SSL host name matching would fail (that’s a nice feature that should be made there).
I couldn’t find a noticeable change in response times when changing the server instance. What else can it be? Maybe it’s an SSL issue? So I checked the time from London to http://slow.com (400ms) and to https://slow.com (600ms). So removing SSL would save 200ms. That’s a lot. Then I called London to https://fast.com, and it took 200ms (remember, this goes through Cloudflare) and London to http://fast.com (with IP address, not with domain name), and it took 250ms! That means that the CF request, even though it was with SSL, took 50ms less than a direct request to the origin server without https. What the heck?
When observing the requests to slow.com, I noticed they are http 1.1, and to fast.com they are http 2. Also slow.com was TLS 1.2 and not 1.3. Maybe that’s the issue?
I decided to make a (quite dramatic) change and put slow.com behind CF service. Now, the https://slow.com took 340ms instead of 600ms! That’s it. That’s the issue. Half the time, by adding another server on the way. How can it be? The reason is that the communication between the client (London) to CF servers was https and from CF to the origin server (N. Virginia) was http (called Flexible SSL on CF). Now CF has plenty of servers around the world, and the round trips for the SSL handshakes had to be done to a local server (probably in London as well), and only the pure unciphered http communication went to N. Virginia. By the way, TLS 1.3 saves 1 round trip than TLS 1.2, so if http is one round trip, TLS1.2 is 3 round trips and TLS 1.3 is 2 round trips. Since the server was 1.2, 2 additional round trips were the root cause. 2 round trips mean 4 legs. If for every 50 kilometers you pay a 1ms of latency, 6000km would mean around ~120ms of latency. Since CF servers were (probably) also in London, the additional SSL (TLS 1.3) handshake from London->London was negligible, and the rest of the communication was plain http to N. Virgina. That’s why https://slow.com (w/ CF) took 340ms and calling http://slow.com (w/out CF) took 400ms. I can’t explain the 60ms difference in favor of CF; it is probably a measurement error, or, maybe, some additional optimizations made by CF. It can be for example, that the CF servers maintained an open TCP connection (keep-alive) with the origin server, which saved successive requests a TCP handshake.
This solved the mystery. Changing CF SSL mode from Flexible to Strict changed the response times from 340ms to 530ms, so this is more proof that the findings are true. In the case of this specific site, there isn’t a major security issue of having unencrypted communication from CF to the origin server, so the solution is valid. Further investigation should be made on what would happen if instead of serving plain http from the origin server to server https from the origin server, but this time with http2 or http3 and with tls1.3. Will http2, that allows pushing resources, with TLS1.3 be better or equal to plain http?