Competitive Game Day

Imagine your ideal on-call day. Do you imagine nothing but a few minor events, each with a detailed dashboard or step-by-step manual? In contrast, we believe that most on-call days should be as seamless and routine as any other day. We invest our time in making the on-call obsolete. However, this presents a challenge: the team may not gain sufficient on-call experience for handling rare incidents.

Forter Payments services are essential to our merchants' checkout flows, demanding exceptionally high availability. To achieve this, we need to prepare for worst-case scenarios. The real challenge is not only preparing as a team but also ensuring that each individual on-call engineer can handle incidents independently.

So, how do we gain on-call experience while minimizing actual on-call duty? We create synthetic on-call experience using Game Days.

Game Day

Game Day is a well-known technique in Chaos Engineering, widely discussed by AWS, Stripe, Shopify, and many others:

“A Game Day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened”
Game Day - AWS well-architected framework

The problem we identified with traditional Game Days is the significant difference in intensity between real production incidents, where there's immense pressure for a quick resolution, and the overly safe, learning-oriented atmosphere of a Game Day. To bridge this gap, we propose a framework for running Game Days that includes competitions between team members. This approach heightens the intensity and amplifies the benefits. Let’s dig in!

Competition Without Losers

We drew inspiration from a completely different field – machine learning, specifically neural networks. One common architecture in this domain is GANs (Generative Adversarial Networks). GANs are known for their ability to generate highly realistic data, such as images, videos, and audio, making them useful for applications like art creation and video game development.

GANs consist of two neural networks:

Generator: This network creates fake data that mimics the real data it's trained on.
Discriminator: This network evaluates both the real data and the data from the generator, trying to determine which is which.

GANs

The generator aims to fool the discriminator, while the discriminator improves at detecting fakes. At the end of the process, we get two specialized networks: the generator, which creates images that seem real, and the discriminator, which excels at detecting fake data.

We aimed to replicate GANs' core principle: using adversarial dynamics to drive improvement for both sides in the relevant domain. Competition without losers.

High-Stakes Training

We aim to leverage competition to address the lack of intensity in Game Day. This is achieved by pairing team members into disruptor and on-call roles. The goal is to immerse the on-call member in a high-intensity environment during their participation in the game.

As a team, we gather to observe them as they attempt to resolve the simulated incident. While it's undeniably stressful to debug under the watchful eyes of your teammates, it's crucial to experience this pressure without risking production. It's important to note that there are no losers in this exercise, and our intention isn't to frustrate the on-call participant. If the on-call struggles, the team steps in to provide assistance.

Finding the right balance between offering support and allowing the on-call to manage independently is crucial. However, we trust the team to navigate this effectively, and with more Game Day experience, the team will surely find the right balance.

Safety First

One of the critical challenges we faced was ensuring that these games had no impact on our live production environment while still simulating real-world conditions as closely as possible.

To achieve this, we leveraged our Infrastructure as Code (IaC) capabilities to create a complete clone of our production stack in a separate AWS region. This included everything from DNS configurations to databases, dashboards, and alerts. By doing so, we created an environment that is identical to our production stack but completely isolated, ensuring that any actions taken during the game would not affect our live services. The only differences between the stacks were the data in the databases and the scale of the services.

Additionally, we used our monitoring system to clearly differentiate alerts generated during the game from those in our actual production environment. This allowed us to conduct thorough and realistic game days, sharpening our incident response processes without any risk to our operational stability.

Competitive Game Day Framework

First, we conducted a Knowledge Mapping process where each team member assessed their knowledge level for each of our services.

GANs

Next, we scheduled a Game Days Week. Each game focused on one service, pairing a Distrupter and an On-call to simulate incidents. We specifically assigned them services where they identified knowledge gaps:

Disruptor: Creatively disrupts the service.
On-call: Simulates on-call response during the incident.

As mentioned above, our primary concern was ensuring the game days were as realistic as possible without impacting the production environment. To address this, we added a monitor (Staff Engineer) to oversee the games and ensure they did not inadvertently affect production.

GANs

Pregame

Before the game day starts, the disruptor prepares an Disruption Plan, which includes:

Setup: Describe the modifications made to code or infrastructure to simulate the incident (e.g., changing security groups to simulate network errors).
Disruption: Outline the steps to trigger the alert (e.g., sending predefined requests to service X).
Expected Thought Flow: Detail the expected actions and thought process of the on-call during the incident until resolution.

The disruptor then verifies the Disruption Plan with the monitor to ensure it's valid, non impacting production, and reasonable.

Meanwhile, the on-call studies the given service thoroughly to be well-prepared.

Tip-off

At the scheduled time, the entire team gathers to watch the on-call's screen. The disruptor executes the disruption as described in the Disruption Plan. An alert is expected to be triggered by our alerting system immediately.

Game on!

Now, the pressure is on the on-call. They must respond as if handling a real incident while the entire team watches their shared screen (with the meeting recorded for future learning). The only rule for the disruptor is not to assist the on-call. The on-call may seek help from the team, mimicking a real-time production incident. The monitor ensures the on-call doesn't change the production environment during the game day and summarizes relevant action points gathered during the game.

GANs

Turning Drills into Skills

One of the key benefits of this framework is the execution of the generated action items. These tasks arise from real-time gaps identified during the simulated incident management. We wouldn't have discovered these tasks through theoretical gap analysis alone. For example, when the disruptor triggers the disruption, they wait for the alert to be triggered. One blind spot we addressed across many games was reducing this waiting time, which effectively reduces the discovery time of a potential incident.

Participants work on these tickets in the days following the game day, ensuring that practical improvements are made based on actual scenarios. This execution benefits us in multiple ways:

Knowledge Spreading: Team members, especially new members, gain a better understanding of the services as they implement the action items.
Proactive Issue Resolution: Addressing actual gaps in our system helps prevent potential future incidents. These gaps might include non-indicative alerts, missing dashboard views, or even the system's vulnerability to specific attacks. By tackling these issues, we enhance our overall stability and reliability.
Automate Incident Management: Automate the manual actions taken during the game to ultimately reduce recovery time.

Theory in Action

To give you a sense of how this works, let's dive into an example from one of our Game Days. In this particular game, we focused on our client-side SDK. The disruptor misused our SDK by calling one of its functions with a null value. This misuse caused the system to report these “integration errors” as system errors, triggering an alert for a high error percentage in the client-side SDK.

The on-call's key to recovery was to demonstrate that all these errors came from the same user (client IP) and conclude that there was no actual production impact.

From this game, we derived the following action items:

Input Validation: Fix the vulnerability the disruptor exploited, and return a proper error message rather than throw an exception.
Integration/System Errors Differentiation: Set low alerts for integration errors and high alerts for system errors.
Error % by Heavy Users View: Add a view for errors by heavy users to the client-side SDK dashboard, useful during integration periods.

Continuous Improvement

Over time, your software services will evolve, and your team will need to refresh their on-call skills. Performing Competitive Game Days regularly helps spread knowledge across the team, enhance on-call experience, and generate actionable insights for continuous improvement of your system.

We recommend scheduling a Game Day once a quarter with a tight schedule of 5-6 games. Both on-calls and disruptors need preparation time, so plan accordingly.

By mastering this practice, you can handle incidents more effectively, leading to higher availability and reliability, which is crucial for any critical service operation.