Why don’t load tests work?
“Why did your website crash? You must have known it would be busy!”
You hear this a lot, when a website falls over at a crucial moment. The same questions are asked over social media, in the same accusatory tone. “They must have known it was going to be busy. Why couldn't they throw a few extra giga-whatevers at it?”
To the disappointed users, it seems obvious that a company should have put more preventative measures in place before a big traffic spike, and that the only thing that could possibly have gone wrong was bad planning.
But what are those preventative measures? Load testing? A ‘bigger server’? More servers perhaps? How about one of those magic, scaling serverless clouds..?!
Often, the first time we meet a team’s web developers, they are at a CrowdHandler onboarding session with their tails between their legs. They feel like they've failed. They did everything by the book and the website still didn't stay up - so now, due to time constraints, they are having to duct-tape it with a waiting room. The first ten minutes of the meeting is spent getting the excuses out of the way.
We’ve heard it all. You’re currently rebuilding the platform and the new site won’t have these problems. You inherited a legacy code-base you can’t change. You’re stuck on an old version of Framework X, which is notorious for performance issues. You could fix all the issues, of course; it’s just a timing thing.
The reason for the embarrassment is that most of the bottlenecks - the problems preventing websites from handling thousands and thousands of users - are based in the code, for which the developers feel responsible.
These bottlenecks are often in the backend: something behind the storefront - a system dealing with ticketing or appointments, or credit-card processing for example. But just as frequently, the bottlenecks are in the storefront itself. In the web code. Code over which the developers have full control.
Listening to the developers’ excuses doesn’t make us feel smug or superior. What we mostly feel is recognition. Here’s the truth: no-one’s code scales first time out of the box, and most of what you’ve been taught about writing scalable code is wrong.
I'll explain why it’s not your fault in a moment, but first I want to introduce you to a new way of thinking about the Waiting Room.
Installing CrowdHandler is not admitting failure. In fact, it could be the first step toward fixing your scalability problem. Instead of a sticking plaster, I’d like you to think of CrowdHandler as a diagnostic tool that will support your work and help you to improve.
It's not your fault
So why isn't it the fault of the developers when the website can't handle the traffic?
The reason that mainstream code is filled with code-related bottlenecks is that mainstream web developers do not typically work on incredibly popular websites. When web devs train, and learn, their understanding of optimisation and how to write performant code is - usually - theoretical.
Learning how to write scalable code in theory and then seeing a hundred thousand users hit your site in practice reminds me of that Mike Tyson quote: ‘Everyone has a plan till they get hit in the mouth’. Now you’ve been hit in the mouth. You should be proud! The vast majority of web developers never work on anything popular enough to find out how that feels.
So to you - and your boss - I would like to say: it's entirely normal that your website can't magically scale to handle any level of traffic. You shouldn't expect it to.
So what’s the answer?
Of course, you want to make your site as efficient and scalable as possible. However, a lot of the presumed solutions for scalability, for the sort of numbers we're talking about here, won't actually work.
You might focus on frameworks and languages that are theoretically faster, but it makes little difference in the grand scheme of things, when you’re serving thousands or millions of users.
You know the thing that does make a difference? Caching. Not the kind of precise object caching you want to do near the server with Memcached, but that dirty edge HTTP caching close to the user that feels like it’s outside your control. That really does improve performance, but it’s hard to get right, and it’s hard to retro-fit to frameworks and content management systems that don’t properly account for it.
So I guess you could start with load testing. But, somewhat controversially, I don’t believe routine load testing is likely to help you much. In my experience, it's often an expensive waste of time.
Why so? Because real user behaviour is complicated. You may set up basic user journeys for your load tests that involve users browsing a page, selecting a product, adding it to a cart and checking out - but that isn't what happens in the real world. There, customers go back and forth, looking at all their options, changing their minds, pausing for long periods before reloading pages, then opening four tabs onto different products to compare. There could be thousands of these complicated journeys happening at the same time, cross-referencing the same database rows in unpredictable ways.
So you may run a series of load tests on the staging environment that break the website at two million users and declare two million to be the magic number. But how realistic are those two million user journeys? Do they really demonstrate two million authentic, usable experiences on the production environment? Or would the live website have crashed much earlier when users started behaving realistically? (Never mind the moment when someone decides to run an intensive sales report during the actual on-sale… oh you’re not testing for that?)
Let’s not forget that load testing is also incredibly time consuming. Even with a simplified user journey script, you will need to re-run and re-run the tests, fixing configuration issues until it works. And then, when it works, you will need to run it until it breaks the website. It can be an endless cycle. Many of our clients simply can’t get their load testing tools to generate the amount of traffic they expect to see in the real world, or find it prohibitively expensive to do so.
Finally: the nature of the load testing we’re describing is a big project. I know plenty of companies that run a big load test once per year. But they are pushing new code into production many times per week. It’s entirely possible for one line of code in the wrong place to invalidate all of that load test effort and to totally change the numbers you are working to. You’re pushing code into production using continuous integration now. If only there were some way to run a continuous load test…
The continuous load test
I'd like you to consider CrowdHandler as a much more productive, affordable alternative to traditional load testing. Why? Because CrowdHandler gives real world, continuous performance information, based on the website that it is protecting, and allows you to work more efficiently and iteratively.
During an on-sale, your dashboard shows how long each page takes to load and summarises overall page performance. In real time, CrowdHandler's auto-tune feature will analyse the page speed, track the number of users and find the optimum rate of new users that your site can handle.
In other words, it is running a continuous load test against your production environment, using real-world users, and adjusting the outcome in real time.
But - unlike a classic load test - it has an in-built safety valve: the waiting room itself. If the numbers turn out to be way off, the worst thing that can happen is not a dead website, but a longer queue.
And then, even if people are queuing a little longer than you'd like, you can watch and learn from the real-world behaviour of the system. You can see which pages are particularly slow, enabling you to find the bottlenecks in your application, while auto-tune manages the queue and keeps the user experience positive. You can see which parts of the site you will need to concentrate on optimising for next time by observing data from a real-life on-sale.
Without a waiting room in place, if your application was suffering from an unknown bottleneck, it’s very unlikely you’d be able to diagnose or even observe it, because your website would be dead, and you’d be frantically trying to deploy emergency mitigation measures. Even if you were able to recover diagnostic data later, a spiralling load emergency, compounded by uncontrolled traffic, is unlikely to paint you a clear picture of the root cause.
And that's why I talk about CrowdHandler as an essential part of your tool kit, rather than a sticking plaster. By choosing our waiting room, you're not giving up and applying a quick fix: you're incorporating an additional diagnostic tool that can help you keep a clear head, and make continuous improvements to your performance, whilst reducing social media flak.
So if you find yourself considering CrowdHandler, please, don’t come to us with your tail between your legs. Hold your head up high! You’re a web developer. You’re working on wildly popular projects and you’re committed to making things even better. You’re doing everything right.