Order out of Chaos

Nine Princes in Amber, cover, via OpenLibrary.org

My Dad didn’t particularly like Roger Zelazny’s work. He thought it was a bit too “weird” for his taste.

This would have been somewhere in the 1980’s, and I couldn’t have disagreed more. At the time, I was inhaling a lot of traditional “hard” science fiction, by authors like Robert A Heinlein and Isaac Asimov, while also reading a lot of fantasy. Then I encountered Roger Zelazny’s work, including the Chronicles of Amber.

I loved it! I guess it’s fair to say that I loved it partly for the fact that it WAS “weird”, and for the way that Zelazny often blurred the lines between science and fantasy. In the case of Amber, there was always another level of reality beneath the “shadows”, and the real world was really a cloud of possibilities between order and chaos.

The idea of “order out of chaos” is not a new one, but I was a bit surprised that searching for “order out of chaos origin” brought up a bunch of sites dedicated to Freemasonry, and the Latin phrase “Ordo ab Chao” is described in Wikipedia as being “one of the oldest mottos of Craft Freemasonry”. I wasn’t able to trace it back any further than that, which I found very interesting, as I expected the quote to be Classical.

In any case, this is where the monkeys come in.

“Ford!" he said, "there's an infinite number of monkeys outside who want to talk to us about this script for Hamlet they've worked out.”

Douglas Adams, The Hitchhiker’s Guide to the Galaxy

This, of course, refers to something known as the “infinite monkey theorem”, which states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type any given text, including the complete works of William Shakespeare.

One of the first tech monkeys appears to have been called “Monkey”, when it was developed by Steve Capps in 1983 to generate random user interface events at high speed for the Apple Macintosh in order to support debugging the MacWrite and MacPaint programs. While I was not able to find a source to confirm my hypothesis, it seems highly likely (near-certain, in my opinion) that the name was a direct reference to the “Shakespeare” monkeys.

This may be one of the first instances in computer science of what is now called “Chaos Engineering”, which focuses on testing and achieving system resilience by generating input, errors, or failure situations in an attempt to identify and address weaknesses in the system.

As the original Monkey appears to have focused on user interface events, it could probably also be called “fuzzing”, though later tools looked at “failures” in a more generic sense. In particular, Netflix created Chaos Monkey in 2011, to test the resilience of their IT infrastructure. While it certainly seems to be a descendent of Apple’s Monkey, or Shakespeare’s monkeys, it is far more disruptive. In contrast to the others, Chaos Monkey disables computers in a production network to test how the remaining systems respond to the outage.

“Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.”

Chaos Monkeys, Antonio Garcia

The failure of a single component of a system can have catastrophic impact, as Rogers Communications discovered in 2022. Deliberately introducing failures into a system allows a team to understand the system-wide impact of different types of failure, and develop options for handling or mitigating their impact.

In any case, Chaos Monkey was quite successful for Netflix, as demonstrated by the fact that it was only the first of the “Simian Army” suite of tools. While Chaos Monkey is quite disruptive, it operates at a relatively small scale. In contrast, “Chaos Gorilla” drops an AWS (Amazon Web Services) Availability Zone (ie, a major hosting location within a country), while “Chaos Kong” drops an entire “Region” (ie, one of the 33 geographies representing the highest-level hosting within AWS globally).

So, we have monkeys designed to be disruptive, with a frequent focus on randomizing the occurrence of specific events, such as lost connectivity or server failure. What’s next?

AI, of course!

AI can be used to analyze systems and test results, but they can also potentially be used to scan systems to identify potential weaknesses, dynamically adjust testing to ensure the best results, or even tailor tests for specific environments.

Future AI monkeys could scan a system, identify tests from a standard library to run, develop or customize additional tests, decide when, where, and how to run tests, and use the results to improve the effectiveness of future tests. This has enormous potential around improving not only resilience, but also security and system efficiency, through identifying and addressing issues and vulnerabilities before they become problems.

And all of this could potentially be done autonomously, without human interaction.

What could possibly go wrong?