Averting the Software Apocalypse
By Curt Dukes, Executive Vice President, Security Best Practices Automation Group at CIS®
Are all the recent outages at the major cloud providers the tremors before “The Coming Software Apocalypse” warned of in the Atlantic article? The author identified several ongoing efforts at companies and in academia that could help stave off this event. Chris Newcombe and the other Amazon software engineers surprised us by choosing formal methods so we decided to delve into their technical report on the “Use of formal methods at Amazon Web Services.”
The performance increase on our data backbones has lured us into a new form of consolidation—cloud computing. Companies keep trying to shed more overhead and outsource to 3rd parties any service that allows them to reduce payrolls and avoid recurring upgrade costs. Bully for them. But risks tend to shift somewhere else, not go away. Everything as a service implies the risk of single points of failure of a magnitude that we might not fully comprehend until there is a sustained outage with significant loss of services and data.
Formal methods, has the reputation of not scaling to complex problems. The quote from Newcombe cited by the Atlantic article sums up the problem of hyper-consolidation very succinctly, "human intuition is poor at estimating the true probability of supposedly ‘extremely rare’ combinations of events in systems operating at a scale of millions of requests per second.” We have exceeded the grasp of what we can model on a whiteboard. As an unabashed computer scientist, I find it startling that formal methods might be part of the solution.
Newcombe and Amazon sorted through several formal modeling methods until they found the specification language they needed in TLA+. In effect, TLA+ translates design requirements into a type of formal logic that is a “Temporal Logic of Actions.” In other words, requirements expressed in this language can calculate the effects of stepping through an algorithm. Actually, Chris coined a better term for what TLA+ offers when they found they could create more converts by calling it “exhaustively testable pseudo-code.” This pseudo code can be exercised to verify that the system will still be able to do what it is supposed to do as tweaks are made for better performance. It is actually more than a little scary; TLA+ allowed Amazon to “make aggressive optimizations to complex algorithms without sacrificing quality.” By quality they meant safety.
As for security, Amazon states they still do all the usual things such as static code analysis and fault-injection testing. Time will tell whether the tools that allow them to make fearless choices for performance will result in a more robust or more fragile system when assailed by a malicious party. Those who worry ‘more fragile’ can point to the authors’ observations that they cannot model system properties that can cause performance to degrade. What happens if an intruder ever manages to slip in unnoticed?
So, what might TLA+ offer other software development efforts? For those who do not need all the flexibility of TLA+, PlusCal exists, which is a lighter weight version of TLA+ retains most of the virtues of TLA+ but allows a programmer to start creating rigorous pseudo-code more quickly without having to understand everything about how the TLA+ modeling works.
The Amazon team pointed out many advantages of using the TLA+ modeling language including:
- Having programmers better understand the problem they are solving
- The creation of rigorous documentation that should allow easier porting and code revisions over time
- The flexibility to add features with the confidence that the original software functions will still perform correctly
- The means to reduce complexity, which can improve reliability and security
Those are virtues needed by any critical service. Are they being achieved with current best practices?
Amazon provided many examples to make the case that TLA+ has merit and can discover critical logic flaws in code. Who else should be trying these tools? Which other cloud provider is using similar tools or which other critical infrastructure is giving TLA+ or PlusCal a chance? The Atlantic opened their article with a warning from Nancy Leveson, a MIT professor, “The problem is that we are attempting to build systems that are beyond our ability to intellectually manage.” Our question is will we at least be wise enough to try new tools within our reach and attempt to avert the coming software apocalypse?