Automated Chaos Testing on the Front-end
Automated Chaos Testing on the Front-end
- Summary: Chaos engineering is essential to optimize the resiliency of a software system by simulating failures and measuring their impact. However, there has not been much discussion or activity from web or mobile communities to ensure the front-end is as resilient as possible. Twitch wanted to know how their front-end behaves and what the end users see if a part of their overall system fails. They explored the use of GraphQL calls with Chaos Mode to simulate failures and found that they needed a way to discover the services that should be forced to fail and gather more useful information from test runs. They used tracing and chaos tests to calculate a resilience score for each test. They plan to improve Chaos Mode and tracing of service calls to address current limitations.
Simulating System Failures
- Summary: Twitch's data is served by hundreds of microservices that depend on each other, and the most common failure is when a microservice errors out and fails to serve its portion of the data. Chaos Mode was used to pass an extra header with GraphQL calls to simulate certain failures at the GraphQL resolver side. To address the issue of the Discovery of services that should fail, they traced the GraphQL calls from the client, recorded which internal services were hit for a certain user flow, and used that list to run Chaos Mode tests. They also extracted more useful information from test runs, such as which service caused the failure and what particular API call was affected, and used those to calculate a resilience score for each test.
Conclusion
- Summary: Twitch identified a big gap between the Android and iOS resilience scores using the dashboard they developed, which uses chaos testing to calculate a resilience score for each test. They plan to improve Chaos Mode and the ability to trace service calls to address current limitations. Chaos engineering is essential for optimizing the resiliency of a software system, and it is important to ensure the front-end is as resilient as possible.