“Chaos Engineering” is defined by the experts at PrinciplesofChaos.org as:
“The discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
When we talk about building confidence in a system, we’re usually referring to developer-centric testing. Unit tests and integration tests with specific inputs and parameters should lead to expected and repeatable results.
For example, you might have a test routine which calls an authentication function and passes in a token. We already know if this token is valid or not, and we have an expectation for the function to either pass or fail based on this fact. These tests ensure the expected outcome is met. However, how many production outages have you experienced as a result of something you expected? This is where Chaos Engineering comes in.
Chaos Engineering is a new, additional resource in the confidence tool belt that works by injecting randomness.
This means that we are working with various unknowns and hypothesising outcomes. For this reason, Chaos Engineering should not be considered another test routine, but rather an experiment—we have a hypothesis, a control, a test, and hopefully a conclusion.
The core principles of chaos are:
- Start by defining ‘steady state’ as a measurable output of a system that indicated normal behaviour.
- Hypothesise that this steady state will continue in both the control group and the experimental group.
- Introduce variables that reflect real-world events, like out of memory disk space utilisation, network latency, failovers, and so on.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
Monitoring & Observability
As with any experiment, we need to start by defining our control as well as tooling to effectively establish what a ‘steady state’ actually is. Monitoring and observability are key to being able to safely inject chaos into your environment—having reliable and trustworthy data is crucial.
AWS provides its CloudWatch monitoring service by default, and many services automatically integrate with it.
Cpu utilisation and other services:
RDS Queries Per Second:
App Response Times:
App Error Rates:
Using the above tools means that we can start to define some “normals” for the system. We can identify an average response time, queries per second, CPU utilisation spread, and a bunch of other metrics.
These together define the state during day-to-day operations, and as a result, allows us to conclude what a steady state looks like.
Once we have effectively established a steady state you can start to spot correlations between resources. For example, a higher CPU spread may coincide with an increase in database connection and queries.
In a well-architected environment where we have dynamic compute scaling, you should not notice much (if any) drift in response times. AWS enables you to deliver a consistent end user experience through these mechanisms. These can also be used to create a self-healing architecture.
Hypothesis & Control
I mentioned earlier how chaos engineering is a tool for building confidence in a production system. However, I would not advise diving in and injecting chaos into your live environments straight away.
One of the great benefits of AWS is that everything is an API. All operations you conduct within the AWS console can (and absolutely should) be handled programmatically. This reduces repetition and configuration drift, whilst also allowing you to design workflows and lifecycles for infrastructure, just as you would with software.
Utilising repeatable infrastructure as code and the ad-hoc nature in which AWS services can be created means you will be able to rapidly deploy identical, isolated versions of the same infrastructure. For now, your precious production environment which you have spent many hours caring for can be your control and you can experiment in isolation. Deploying and destroying the test infrastructure at will, and only paying for the time you are conducting experiments.
With a control and steady state in place, we can now start to hypothesise. This can essentially come down to you and your ops, development and management teams agreeing on a set of expectations for the platform and end user experience. These can and should be aligned with your organisation SLIs and SLOs (which I will be talking about in another post).
- The platform should be resilient to a single instance or AZ failure
- End user response time should be no greater than 2000ms for 99% of requests
- The Frontend React site should not be impacted by a database outage
In a typical Well-Architected three-tier application stack, these should all be easily achievable expectations and a solid foundation for exploring chaos engineering. You can adjust response time requirements by reviewing the baseline defined as your steady state.
- we have a testing environment which replicates production
- we have implemented and are familiar with Cloudwatch metrics
- we have defined a steady state and set expectations for the platform
Starting the Chaos
We are now ready to start injecting variables into the state in order to test our hypothesis.
The most well-regarded tool in this sector is the infamous Chaos Monkey. Chaos Monkey is intended to be deployed to Kubernetes and utilises the CI/CD software Spinnaker. It is a fantastic tool for experimentation and if you are currently using this stack well worth checking out.
A lightweight alternative that is low cost, serverless and targets EC2 and ECS workloads is the little-known Chaos Lambda: https://artillery.io/chaos-lambda/.
As the name suggests this is a Chaos Monkey-like serverless toolset that runs in AWS Lambda. Chaos Lambda can be configured to run at regular intervals and terminate instances in autoscaling groups at random. We can then observe our testing environment in Cloudwatch and cross-reference it with our defined steady state.
With this simple test, we can be confident that our platform is resilient to instance failure and observe the impact on end user response time. The auto-scaling and self-healing properties offered by AWS enable us to achieve these results, unlike more traditional hosting.
We can also use manual interventions to test specific scenarios. For example, we can trigger an RDS or Elasticsearch failover, saturate RDS connections, or simulate a large amount of HTTP requests. All the while referring back to our observability tools and baselines and providing the confidence that you won’t have to wake up at 3am on a Sunday.
Chaos engineering is an excellent and exciting new resource for validating a platform and building confidence. It can also be used perfectly to stress a platform and push it to breaking point, which you should try to do regularly.
With this powerful information you can build runbooks, identify and remediate any reliability issues (before they hit occur in your live environment) and ultimately reduce the blast radius for any unforeseen service issues; let’s be honest, they are very rarely “foreseen”.
Of course, not everyone is ready to implement Chaos straight away and a strong DevOps culture is key. Once you build confidence in a system and start to introduce chaos in a live environment, having effective communication and shared responsibilities and values between teams is crucial.