Making gamers happier, one nudge at a time Queueing Theory: A Brief introduction to Nudge
I've always been intrigued by queueing theory. We are intuitively familiar with queueing in our day-to-day life, like waiting in line at the bank or in the car waiting for our turn to pass a four-way stop.
I had the opportunity to focus on Queueing for Luna, Amazon's Cloud Gaming Platform. If you've ever played online games, you're familiar with waiting in a queue to find a match or for the game to launch.
Waiting can be frustrating and leads to disengaged gamers. That's why reducing the time it takes to get to the fun stuff is essential.
I ran into this video from SIGMETRICS 2021 about Nudge while exploring “Time in Queue” optimizations. Nudge is a job scheduling policy that is an optimization on top of First-Come-First-Serve (FCFS) or First-In-First-Out (FIFO). I was immediately intrigued.
FIFO/FCFS is a widely used scheduling policy in the industry due to its performance and simplicity. It is available in most clouds, like AWS SQS, Azure Queues , and RabbitMQ.
FIFO does not require knowledge about the size or complexity of processing the items in the queue. It has minimal overhead while ensuring that no job is left waiting indefinitely, making it a good choice for ensuring fairness among all jobs.
Nudge
Here’s how Nudge works:
When a small job arrives in the queue, the scheduler checks for the job immediately ahead of it. If that job is bigger, we swap their position in the queue and prioritize the smaller job.
There’s a limit. A bigger job can only be “jumped” once.
This is a simplistic Nudge Scheduling Queue animation:
It has only two types of jobs, small and big. Each job shows its “arrival timestamp” (in mm:ss.ms), making it easier to confirm when a job is older than another job. Finally, jobs that arrive closely together tend to have a similar color. That way, when a job “jumps” over another, it looks out of place.
Here is the interactive version of the animation, and the source code.
Comparing Nudge and FIFO
I wanted to see how Nudge compares to FIFO in terms of reducing wait times for most users while still being fair. I made a simulation that models a more complex gaming scenario to find out. The goal is to see if Nudge’s benefits in getting gamers to play the game faster (the fun part) are significant enough to justify a complete implementation.
In the simulation, we have four games Tunic, Hollow Knight, Lost Ark, and World of Warcraft. It models traffic with imaginary values where each “Game Activity” represents someone that plays the game on some imaginary platform, so they download, install and launch the game.
Nudge uses “job size” to decide when to nudge. I used the size on disk for each game as a proxy metric for the time it’ll take to finish each one of the steps until we have the game running.
In queueing theory, the size usually represents the time it takes to dequeue and successfully process a job.
Rules of the Game
In the simulation, small games are 10GB, and big games are 50GB or more. There are four games:
- Tunic is small and very popular.
- Hollow Knight (HK) is small but not so popular.
- Lost Ark (LA) is medium size and somewhat popular.
- World of Warcraft (WoW) is big and just as popular as Lost Ark.
I used the concept of “Game Runners” that download , install and launch the game (ignoring launch time to keep it simple). A typical run with a thousand gamers, would get a distribution similar to this:
Game Distribution:
"Tunic": 556,
"World of Warcraft": 186,
"Lost Ark": 168,
"Hollow Knight": 90
This means gamers played Tunic about 6x more than Hollow Knight, and Lost Ark and World of Warcraft were approximately 2x more popular than Hollow Knight.
The “Game Activity” keeps track of the activity’s metadata:
- When the user “clicks play” to launch the game.
- The time it took to find a runner for the game.
- The time it took to download and install the game.
- If the activity was nudged or caused a nudge.
Each activity launches simultaneously on a FIFO scheduler and the Nudge scheduler. Both use the same list of games and the same configuration values.
Initial Results
I ran the simulation multiple times with 15,000 gamers using the same distribution as above. Here are the summary results of one of the runs:
Scheduler | Average | Best | Worst |
---|---|---|---|
Hollow Knight (1324 players) | |||
Nudge | 1s 753ms | 1s 556ms | 1s 954ms |
FIFO | 1s 749ms | 1s 558ms | 1s 947ms |
Lost Ark (2756 players) | |||
Nudge | 6s 553ms | 6s 357ms | 6s 751ms |
FIFO | 6s 553ms | 6s 357ms | 6s 749ms |
Tunic (8140 players) | |||
Nudge | 1s 751ms | 1s 554ms | 1s 952ms |
FIFO | 1s 752ms | 1s 554ms | 1s 956ms |
World of Warcraft (2780 players) | |||
Nudge | 30s 554ms | 30s 363ms | 30s 754ms |
FIFO | 30s 554ms | 30s 358ms | 30s 751ms |
Summary Results — Nudge and FIFO. Run with 15,000 different gamers.
It didn't show a big difference for all games’ average, best or worst time in queue. Even when the Nudge scheduler "nudged" 2566 times.
The average time in queue for Tunic (the smallest and most popular game) is 1.751s (Nudge) and 1.752 (FIFO). World of Warcraft (biggest game) is tied in both at 30.554s.
Different runs showed slightly different results but nothing widely different. Two things that I was expecting to see:
- Much longer times for WoW since it’s technically de-prioritizing it.
- Slightly shorter time for Tunic on the worst time in queue.
Plotting Game Activity
I wanted to visualize the Game Activity metadata to understand what was happening, and things got interesting again:
The chart above reveals that Nudge (blue line) generally results in shorter wait times (or ‘time to play’) than FIFO ( green line).
The scatter plots above compare the performance of Nudge and FIFO per game. Each game is represented by a different color, and the circles’ width corresponds to the game’s size. It shows the relationship between each player’s arrival time and their time to play.
Nudge effectively reduces wait times for smaller games, with little to no impact on larger games. In fact, the data shows that Nudge has a better ‘tail’ (lower wait times) for all our gamers. How exciting!
Time to play distributions
This was the best part for me, it clearly shows what I was expecting to see in the summary results:
This paints a better picture of how Nudge works and gives us a better understanding of how it affects our scheduler’s “long tail” behavior.
Nudge generally leads to shorter wait times for gamers.
Understanding Nudge
Let’s break it down only with Tunic (red) and World of Warcraft (green). The bars represent the frequency of gamers’ wait times. As we run out of resources, the wait time increases.
For Tunic, Nudge significantly increases the number of players who waited less than 200ms to play.
In contrast, for World of Warcraft, the number of players waiting less than 100ms is lower and decreases further between the 100–200ms range. It's pushing the bulk of waiting time to the 200–400ms range but reducing the total number of players who needed more than 800ms.
The data shows that the distribution tails (representing the gamers with the worst wait times, like the one on the initial tweet) are thinner with Nudge. Lost Ark has a slightly longer tail, but that looks like an outlier.
Nudge generally performs better, reducing the worst wait times for all players.
Outro
I’m confident that Nudge generally improves waiting times for this use case and is especially effective at reducing “tail latency” (the longest waiting times). From that perspective, it makes sense to experiment further.
However, implementing Nudge at scale might be a different feat altogether. Existing queue services like SQS do not provide a way to “bring your own scheduler”. Implementing Nudge would require adding an abstraction layer on top of the queueing system or replacing it entirely.
I think there are two abstractions that are not terribly complicated:
Implementing a database-backed queue like this one.
A simplified version of a database-backed queue, using a table to track the state of the jobs. Query the table to implement “peeking”, then implement Nudging by “tomb-stoning” the nudged entry on the database so it’s a no-op, and insert it again with a new id after the item it just arrived to represent the state after nudging.
Then there’s RabbitMQ. It has support for plugins, and there are existing plugins that implement prioritization, building a nudge plugin might be another solution.
The next steps for me are exploring these abstractions.
Stay tuned, and thanks for reading! Now, coffee time. Enjoy!
Coffee time
Coffee Time: Espresso Martini
Method: Espresso
Preparation Time: 10 minutes
Steps
- Pour 45ml of Vodka into a shaker
- Add 20ml of coffee liqueur
- Add 10ml of vanilla or simple syrup
- Add 1 shot of Espresso
- Shake well
- Serve into a chilled glass, strain the ice
- Once served, spritz a lemon twist over the glass to get a bit of citrus oils a nice contrasting aroma
Here's the full Espresso Martini explanation.
Learning More 📚
Dive deeper into Nudge with the video introduction, the full paper, and proof that even with weaker conditions Nudge is still better than FIFO for light-tailed job distributions (like the one shown above).
Explore how distributed systems rely on queueing to handle high loads and spikes in traffic on Amazon Builder’s Library.
Learn about the Queue-Based Load Leveling pattern on Microsoft’s Cloud Design Patterns.
Read about load balancing, queueing and the power of two random choices on Nginx’s blog.
Join the Conversation via Github and Giscus