Could it be magic? Auto-scaling with CrowdHandler

Every so often, we see comments from developers on social media wondering why anyone would use a waiting room solution. To them, installing CrowdHandler would be an admission of defeat rather than an astute addition to their toolkit.

After all, why would you need a waiting room solution when you can just use a magical, infinitely scaling cloud?

“Decent load balancing strategy with unlimited expansion vs customer annoying waiting room hmm hard choice”

Unlimited expansion? Doesn't exist

For some reason, a lot of developers seem to assume that their cloud solutions – such as autoscaling EC2 groups, AWS Lambda or the Azure or Google equivalents – are able to scale infinitely.

Perhaps they believe they are made of magic.

The truth is that cloud solutions can only scale within certain parameters and, regardless of how many servers you can magically add, you will eventually hit limits.

Your parameters will include those that you will have had to set and pay for yourself - in other words, the parameters within which that cloud allows you to operate based on your price range. But there will probably also be an ultimate limit based on your database technology. (After all, no ACID storage layer can scale infinitely, because by its very nature, no ACID transaction can be infinite.)

So, when I see a comment like the one above, I know the person hasn't actually done a lot of scaling in the wild.

Because, once you have worked with an application attempting to scale to huge levels of traffic in a very short space of time, you will hit those parameters and you will understand that there's no such thing as "unlimited expansion".

Auto-scaling works . . . until it doesn't

Sure, you have a really nice scalable application that can auto-scale throughout the course of a day, or maybe over a month. You can tweak it to meet varying traffic levels without any problems, even during the busy seasons. Your auto-scaling architecture will notice you're a bit busier and automatically add servers to deal with it... then it will notice when things quieten down, and take them out of the loop. It works well.

This is all great, 99% of the time. But the traffic patterns that would require your auto-scaling architecture to be infinitely scalable are not like this at all. We are talking about the scenario where your application is suddenly ten times busier, a hundred times busier - maybe even a thousand times busier . . . in under a minute.

Suddenly a thousand times busier

A ticket onsale is the most obvious example of this kind of sudden traffic peak that we observe here at CrowdHandler. When tickets for the latest megastar tour go on sale, for example, a seller's site can go from zero to a million on the dot of 10am.

But it’s not just megastar ticket sales. A sudden traffic peak can happen to anyone. Any kind of sale with a drop dynamic; sudden scarcity of a popular line; a mention of your product by a social media influencer (predictably or not!) These can all bring on a surge of traffic.

What tends to happen in these cases is that, yes, the auto-scaler recognises the traffic increase - perhaps it notices that page load times are getting longer, or CPU usage on the servers are higher - and reacts by bringing up more servers.

However, this process isn't instant. Containers need to boot and caches need to warm up.

It's ironic really: as auto-scaling behaviours kick in, we usually see a marked decrease in performance. This lull, until systems stabilise again, lasts up to five minutes.

If a waiting room has been set up, this is the point at which it kicks in. When performance slows down to the point where end users would otherwise be in danger of experiencing a crashed application, the waiting room is there to hold users at bay for those few minutes while everything warms up again.

Triple 9 uptime

By the way: what does your SLA say about overall uptime? Because, over a month, those five minutes it took to scale back up will hardly register, despite potentially doing a lot of damage. Do the math: 5/43920 means you're still seeing 99.99% uptime! But those five minutes are probably the most critical you'll have all month in terms of user experience and reputation.

So what can you do to prevent problems?

Mitigating traffic overload - 1: When it's predictable

Often, you know when a traffic peak is going to hit. You might do your big ticket onsales on Friday mornings. You might be planning a Black Friday sale. So our advice is: don't wait for the traffic to autoscale! Schedule your magic cloud to scale up in advance. (This is what our friends at Made Media do: they know that a bunch of their clients are going to go on sale, they know that this typically happens on Friday mornings, so their infrastructure is scheduled to scale up every Friday morning.)

And … you guessed it: install CrowdHandler. Because even if you are extremely well versed in the parameters of your scaling, and extremely confident that you understand the level of traffic you're going to receive, there's still a massive margin of error. A waiting room, however, will stand up to almost anything (and remember, it won't actually engage if it's not necessary).

We can also say with confidence that surprises happen. Often, a customer will query why a waiting room has engaged and tell us that their CrowdHandler dashboard is reporting the "wrong" traffic, or "traffic that isn't there". But the traffic is there - it’s just unexpected. Later, they find out that they got a shout-out from an influencer, causing high numbers to hit their site at a time they weren't expecting. In the ticketing world, we are seeing more and more artists jump the gun and share a link as part of a tour announcement drop without consulting the carefully prepared PR schedule.

Mitigating traffic overload - 2: When it's unpredictable

If you don't know when the traffic peaks are going to hit (or you are coordinating with unpredictable influencers!) then you should definitely install CrowdHandler as a catch-all, running all the time.

Even better, use it with Autotune enabled. In this scenario:

Traffic increases
The queue engages and CrowdHandler's Autotune manages the rate
The auto-scaling cloud responds to the increased load and starts scaling, which slows performance down
CrowdHandler's Autotune senses the slowdown, holds some users in the queue and gives them a positive user experience with appropriate messaging
Performance levels come back to normal and Autotune increases the rate to rapidly empty the queue

A final word on user experience

Here's a funny thing. Even if your magic auto-scaling cloud worked without a hitch and your product sold out in less than a second, your users would have a less than optimal experience (and may share their hard feelings). That's because, when something happens very quickly - for example if there's seemingly no time at all between the transaction becoming available and being sold out - it can feel very unfair to the end user. We have had customers express to us that the CrowdHandler process and its associated messaging keeps users on side, by giving them time to breathe, and keeping them informed.

So, when the waiting room kicks in, trust the process. Don't panic (but do use the message fields wisely). Your users will be much more comfortable waiting in line for a few seconds or minutes than they would be looking at a crashed site or an immediate "sold out" page.