The Chaos Experiment

By Sean Atkinson, Chief Information Security Officer

CISO blog

Progression through development operation methods means introducing new elements that are assigned to break functionality rather than to confirm it. The chaos principles define a method for introducing the unknown into the process of development operations. One helpful strategy is to define the baseline – that is, the steady state of the operational process or distributed system – to gauge the effect of chaos.

Since introducing chaos to a system is similar to a scientific experiment, security practitioners should define a control group and an experimental group.

Introducing chaos

Start with chaos – the introduction of methods to disrupt a steady state. Can you introduce elements into stable systems that will cause a deviation in service? This will help you identify areas of weakness in:

  • System continuity
  • Service resilience
  • Response actions

Collecting measurements in such an experiment is key. When introducing potential elements of chaos, can you define any change, variation, or percentage of downtime experienced in the experiment versus the control?

Measuring test results

First stage – “A point in time”

By introducing an element of chaos and measuring its effects on the system, you’ll be able to understand and provide improved service continuity for the end user. Positive results are a good momentary result for your first experiment. This confirms service continuity at a point in time.

Second stage – “Continuous chaos”

The next stage of the experiment is continuous chaos. This creates a process that can be used to continuously test system continuity. At any stage, the question to be answered is not what allowed the steady state to remain at control levels, but whether you have verification of the system’s operation and effectiveness.

The same chaos experiment run repeatedly will provide a level of confidence in the system. Introducing continuous chaos is not a measure of overall resilience, however. To measure resilience, you must apply new tests and new combinations of tests that illustrate the capability of a system.

Third stage – “Ultimate experiment”

Now, the ultimate experiment: run against a production data set. The configuration and mimicry of production is seen in test and development environments, but the only true test is on production. “Production” and “chaos” are rarely used in the same sentence; here it is a necessity. A chaos experiment on production allows you to measure a system’s traffic, patterns of use, resource utilization, and practical application. This information can help build resilience within the system and improve cyber defenses.

Containment criteria

While adding chaos to stable systems can yield results and identify incident reduction factors, practitioners still have to deal with the reality of appropriate testing. No clone or mirror environment system can effectively test the gamut of potential external threats. In these cases, you should focus on containment. Even with production being the target of chaos, the focus of those maintaining systems is to make sure that effects are not catastrophic and to provide minimal impact to the end user. Testing and experimentation will allow for control group and experiment group conditions to be set and issues to be tracked. Then, the system response team can monitor and adjust to reduce chaos factors.

Is running a chaos experiment right for my team?

Whether or not to run a chaos experiment comes down to the effectiveness of the test to determine resilience with minimal impact on day-to-day operations. The question becomes, “Is a small amount of pain worth a stronger long-term system life?” The answer will be different for every organization. It depends upon factors such as resources devoted to the experiment and system maturity. In general, chaos experiments require:

  • a team to monitor the conditions,
  • specialists to administer the chaos, and
  • leaders to acknowledge the remedy.

When effective, chaos experiments can act like a “stress test” for systems. From a monitoring perspective, practitioners are like doctors examining a patient. You have the baseline, you know the regular pulse of the system, and through a continuous chaos test you have more focus on the health of system. In effect, you become the cardiothoracic surgeon paying particular attention to minimal variations to establish stability and remedy issues.

Questions for the reader

  • This blog post focused on security and development operations perspective. As aligned to principles of recovery and monitoring, is the value of chaos experiments worth the results?
  • Is “chaos” already running the moment a system is initialized and made available? That is to say, are the users all the chaos we would ever need?
  • Continuous Integration, Continuous Deployment, Continuous Monitoring, and Continuous Chaos: Are all these processes needed to provide excellent operational services or just another set of instructions to work faster/smarter?