1.0 Launch Retrospective

EHG_Kain · March 20, 2024, 9:59pm

Hello Travelers!

Today’s blog is a little different from our usual fare. As most of you know, Last Epoch launched on February 21, and the reception has been amazing. In the first week after launch, over 1.4 million of you logged in to play Last Epoch. At our peak, we had just under 265,000 players all roaming Eterra simultaneously. That’s good enough for the 39th highest all-time concurrent player count recorded on Steam, and we’re humbled by your support and enthusiasm for the game.

There’s plenty of cause for celebration, but let’s not ignore the obvious: Last Epoch’s launch was pretty rough for the majority of you who play online. Your patience and positivity have been amazing, but it was obviously not the launch experience that you or we had hoped for.

Now that the initial launch woes are behind us, it’s time to reflect on the experience. What happened? We put a strong emphasis on testing our servers and infrastructure ahead of time, so what did we get wrong? Last Epoch’s backend team is here to give you a bit of a recap of what went on during the launch.

How Our Game Works

Let’s begin with a quick explanation of how our game works when played online. When you boot up Last Epoch and enter the game, what you see as a player is relatively straightforward; first you log in, then you select your character, and then you join a game server. Behind the scenes, though, what you’ve just done is communicate with and move through half a dozen online services. These services log you in, provide you with game data, and give you a server to connect to so you can play the game. You connect to this game server, but then the server itself goes out and talks to even more services to authenticate you, load your character data, and check things like your party membership.

These supporting services are the “backend” of our game, and without them our game doesn’t work. Some services are more important than others, but as a general rule, most services are required for the game to function. The good news is these services are all pretty resilient. Behind the scenes, each service is not one program but many copies of the same program. If one of the copies breaks down, the other copies keep working. Crucially, if a service is overloaded, we can fix that by just deploying more copies of the program. If our services are designed properly, we can handle any number of players; all we need to do is throw more copies at the problem.

(This design applies to our game servers, too; Last Epoch has never once “run out” of game servers for players because they’re all interchangeable, and we can create as many new ones as we need. The closest we ever come to running out of servers is when we get a spike in players and need to wait a short amount of time for extra ones to boot up).

Preparing For Launch

The “designing our services properly” bit is the hard part. For some of our services, it’s very challenging to design them in a way that actually lets us scale just by throwing more copies at the problem. If you were around for the launch of patch 0.9.0, our first multiplayer release, you saw our game go down as soon as we crossed 40,000 players. Why? The service that matches players to servers had a design flaw where it started slowing down when there were many servers available. Once about 40,000 players joined servers, the performance of the matcher got so bad that it didn’t matter how many copies we threw at the problem - every copy would crash under the strain.

The backend team spent much of the time between 0.9.0 and 1.0.0 hunting down and fixing these sorts of design flaws. Some flaws could be fixed easily, others required entirely new services to handle our players’ data in a way that could actually scale. We were given a blank check to do whatever was needed to support our launch, so the only obstacle was time.

By the week before launch, we seemed to be in a reasonably comfortable spot. Everything was built, and we’d performed multiple rounds of load testing on our entire backend to ensure it could handle the volume of requests we expected at launch. The results were promising, and we hadn’t pulled any punches on the testing either. We were as ready as we were ever going to be for launch.

The Morning Of Launch

For something like a game launch, “readiness” is mostly about having a plan. You can have confidence that you’ve tested and prepared, you can have confidence that you know how your services work, but you can’t have confidence that nothing will go wrong. Instead, you plan for what to do if something unanticipated happens.

On the morning of launch day, we went to scale up our server matching service to the numbers we used in our tests, and to our great surprise, it refused to spin up more than half the copies we asked for. Server matching is a critical service, and in our testing, it needed a high number of copies to handle all the players we expected, so getting stuck at half capacity was a serious problem.

This wasn’t even the only pre-launch hiccup. In a case of unfortunate timing, our service host had an incident the night before - still ongoing at launch time - that affected us in a way that prevented us from deploying changes to any of our backend services using the tools we had relied on for months. Our ability to fix our services was killed at the same time one of our services needed fixing.

We had workarounds for these problems, but they were not quick fixes. We were going to need to break apart our deployment tools and move our services around manually, but this was not something we could sneak in before the doors opened to all our players. Minutes before launch, we estimated that we could handle maybe 120,000 - 150,000 players before things started to fail, and we crossed our fingers that we’d be able to resolve our issues before the player count crept too high.

Well, you know how that went.

The Next 5 Days

What unfolded over the next five days was a blur of emergency fixes and risk management. As it so happened, our first two pre-launch problems were only the tip of the iceberg.

In software, you sometimes run into a problem called “cascading failure.” When different parts of a software system rely on each other, an error in one part can cause errors in all the other parts as well. This can make it look like the entire system is failing even though only one part actually is. Finding the root cause of the failure is very difficult when everything is failing all at once.

The server matcher problem had caused a cascading failure in our systems. When players fail to connect to a server, they usually just try again, meaning our struggling server matcher had to deal with 2-5x the number of requests it would normally need to deal with. In many cases, players would get through the first half of server matching but would fail the second half, meaning servers would bring themselves online and then shut down again because a player never connected. Servers booting up and shutting down put pressure on our other services, and so some of those also started to fail.

When we fixed the server matcher, some other services continued to fail because they had trouble recovering from the chaos. Our deployment tools still required attention, so fixing these other services was a slow and manual process. To clear up the backlog, we needed to scale some of our services way past what would have been needed had the game been working smoothly. This brought with it new challenges since cloud services have some built-in soft caps that we would never have hit under normal circumstances, and working around those caps took time and, in some cases, code changes. We identified and cleared away many of these caps before launch, but we hit new ones as we scrambled to rearrange our backend.

You may be wondering, if the problem was recovering from too many players, why did we not simply have some downtime, or at least turn on player queues, to alleviate the pressure? The answer is that we did, but the problems ran deeper than the server matcher. At various points during the launch, we brought down our services, and many of you found yourselves in long queues as we struggled to keep up. Inevitably, once we started letting you all back in, we would run into problems again, and we could not clearly see what those problems were until we scaled everything up so high that the services stayed online and operational even though they were strained from other failures in other areas.

Sprinkled in with our deployment woes, we had a couple of genuine code problems in our services. One of them - one of the few examples where we straight up overestimated our ability to scale - was a bottleneck in how quickly we could process requests for a single town in a single region of the world. In Last Epoch, once you reach the “End of Time” town, you will always load into that town when you enter the game from character select. We knew ahead of time that this bottleneck existed, but we underestimated what would happen when we suddenly fixed a broken game and hundreds of thousands of players all tried to access the same town at the same time, in the same part of the world. We thought that the server matcher would be a little slow at first, and then it would start to fix itself as more and more people got into the game. What happened instead was that it took so long to get into the game that almost no one actually succeeded, so they quit and tried again, over and over, and the problem never tapered off. This was not the kind of problem we could fix just by adding more copies of the service, so it took some emergency problem solving.

Each time we found a problem and fixed it, we immediately saw improvement, but this allowed even more players to enter the game and play, which would uncover the next problem, and so on.

The Final Fix

By Sunday, we had managed to fix, deploy, and scale our services to the point that most of our backend was handling over 200,000 players just fine, even through flurries of retries and errors. Yet amidst all the chaos, there was still some strange behavior happening on the game servers that was causing problems. During our periods of stability, when the game was up, players were able to connect to game servers, but their connections would often time-out once they got there.

Every time a player joins a game server, the game server checks to see if the player is in a party. This is a simple operation, and in all of our testing, we saw that this check completed very quickly, even under heavy load. Yet our logs were telling us that checks for a player’s party were taking up to a minute, sometimes even longer.

Over the first four days, we made a number of changes to the party service to alleviate pressure. Each fix helped for a while, but inevitably, it always slowed down again until players could no longer join servers. On day five, with all the other backend problems solved, we were able to get a more precise look at the party errors, and the culprit was a single, innocent-looking line of code. A single line of code that was supposed to be the most efficient request in our entire party service but instead ended up consuming all of the service’s resources under heavy load, slowing the entire service to a crawl.

It took about an hour on Sunday afternoon to rearrange the party data so we could check over 200,000 players’ party memberships without bringing down the service. We deployed the fix, the game came back up, and it’s been online ever since.

Lessons Learned

This blog post is 2,000 words long, and there is still a whole lot more we could say. Internally, we have been cataloging and planning for ways we can improve, and we want to ensure that our processes moving forward include the lessons we have learned from the launch.

First, we learned the hard way that our internal tooling for deploying our services was not robust enough on launch day. Our tools were too brittle (breaking when certain services went down) and too inflexible (too many manual adjustments needed in an emergency). When the system came under strain, we couldn’t deploy our fixes quickly, and we usually had to cause additional downtime to do it. Had our deploy tooling been stronger, we could have gotten to a stable state much more quickly. Our top priority on the team right now is improving our tooling so we can effectively respond to situations like these.

Second, our services themselves could be more flexible. We had to make many changes over the course of the launch that should have been simple configuration changes but instead required a full redeploy, which turned simple fixes into long, risky operations. This weakness was identified ahead of time and has now become a top priority to improve.

Third, we need to do a better job of anticipating how player behavior affects our backend. Our testing was designed to simulate how and when our services would break, but we needed to spend more time considering how the conditions would continue to change once things started failing. Now that we’ve seen what happens during a fraught launch - how players put pressure on services differently than when everything is working - we’ll be able to incorporate that data into future tests.

(As an aside, our testing effort, in general, was a huge success. Despite how it may look from our launch struggles, our testing identified many other critical issues leading all the way up to the week before launch, and without those efforts, we might still be trying to fix the game to this day. Even though we’re post-launch, we plan to continue incorporating load testing into our regular development cadence going forward.)

Thank You

With the launch behind us, we’re all very thankful to you players for showing so much passion for the game, despite the rocky start.

EHG started from a group of gamers hoping to make the ARPG they wanted to play, and now EHG is a group of gamers hoping to make the game YOU want to play. Your passion, enthusiasm and, when deserved, criticism have continued to encourage our teams to deliver that game and push the definition of what an ARPG can be, should be, and will be. Our team could not have made Last Epoch what it is today without you, and we will endeavor to keep making the game you all deserve.

Here’s to a bright future, Travelers.

GaryOak · March 20, 2024, 10:04pm

Thanks very much for this postmortem!

Makkii · March 20, 2024, 10:06pm

Thx for all
We waiting for next content update love the game

soul02 · March 20, 2024, 10:08pm

I look forward to the next season and the new mechanics that may arrive =D

Yunicorn · March 20, 2024, 10:13pm

Fascinating insight. Thank you very much for this write-up.

Burb · March 20, 2024, 10:14pm

Thank you. <3

For something like a game launch, “readiness” is mostly about having a plan. You can have confidence that you’ve tested and prepared, you can have confidence that you know how your services work, but you can’t have confidence that nothing will go wrong. Instead, you plan for what to do if something unanticipated happens.

This is how you do it. Failure is awesome because it shows you where the edges are. The important thing isn’t that you had a failure, it’s how you respond to it. These folks get it.

VieniSu · March 20, 2024, 10:17pm

Thank you <3

Nevieth · March 20, 2024, 10:19pm

I’m so excited for the future of Last Epoch.

Doxington · March 20, 2024, 10:20pm

I love this level of communication, transparency, and just honesty. Thank you EHG

Hilio · March 20, 2024, 10:21pm

I cannot express how much I appreciate this communication.
Thank you all for the hard work and especially the backend team.

CaiusMartius · March 20, 2024, 10:28pm

You guys are rockstars. From your continued unabashed love of the game, your direct, frank and open communication with the community to your downright tenacious and, at times seemingly, unending work ethic.

Just keep on doing your thing. A hell of a lot of us aren’t going anywhere.

This is something so crucial and so overlooked by 98% of the armchair developers. In the film business 95% of our work is done PREPARING, because as the old adage goes, no plan ever survived contact with the “enemy.” Or in this case, ‘launch.’

Pendragon925 · March 20, 2024, 10:28pm

can’t wait for the future!

TizianOwO · March 20, 2024, 10:39pm

Appreciate the transparency in spite of the fact almost all of the people complaining (and saying you’re fools for not psychically predicting every possible thing that can go wrong), will almost certainly never read this.

bing_crosby · March 20, 2024, 10:42pm

Super interesting, thanks for taking the time to write all this up.

PhantomOfVoid · March 20, 2024, 10:42pm

Thanks <3

Gavryn · March 20, 2024, 10:49pm

Here’s to a bright future EHG! Thanks for the continued communication and transparency, it should be a gold standard in the gaming space.

ulruc2525 · March 20, 2024, 11:03pm

What does not kill you only makes you stronger.

I guess now the H in EHG stands for Hercules

psychodrive · March 20, 2024, 11:24pm

Trying to fit a square peg into a kube hole sure does go this way sometimes.

Keep on keeping on folks

beserkjay · March 20, 2024, 11:44pm

Thank you for posting this. I am really enjoying the game. Keep up the great work!

Jerle · March 20, 2024, 11:53pm

I really enjoy reading posts like these. Keep it up, EHG!