Man-O-War (Pt. I) - Kubernetes Chaos Tools

Introduction

Kubernetes has maintained a position in the spotlight of cloud infrastructure ever since Google donated it to the CNCF in 2015. In a search to strengthen the robustness of my personal Kubernetes clusters I experiment on, I looked for an open-source suite of chaos engineering tools. Gremlin is a firm founded by some of the engineers at Netflix who originally coined the term ‘chaos engineering’, and their product certainly has the most complete set of features, but in terms of open-source projects there were four major options to consider:
If you’re interested, take a look at my notes I collected when I looked through each of their repositories. The bottom line was that none of them were particularly well-suited for bare-metal Kubernetes implementations with the amount of flexibility I wanted, so I began to work on my own project. Once I finish adding documentation to the codebase I will add the code to GitHub and write up a tutorial on use-cases here.
::::: Kubernetes Chaos Tools Review :::::

					::: OVERVIEW :::

[Chaos Engineering] - intentionally injecting attacks or otherwise breaking components of a distributed system to improve
		      fault-tolerance and redundancy

[Kubernetes Objectives] -
~ Identify threats, bugs, flaws in our Kubernetes infrastructure
	~ Aggregate alerts collected by Prometheus Operator
	~ Read through other corporate outage reports and Kubernetes/cloud post-mortems
	~ Utilize a Chaos Monkey inspired tool to uncover new weaknesses

~ Measure the impact of each system weakness
	~ How significant is failure in clusters?
	~ How available is Kubernetes? How durable on average?
	~ How do internet-facing applications introduce new weaknesses?

~ Neutralize weaknesses in categorical order:

        --------|--------
        |       |       |
Known   |  [1]  |  [2]  |
        |       |       |
        --------|--------
        |       |       |
Unknown |  [3]  |  [4]  |
        |       |       |
        --------|--------

         Known    Unknown

	~ [1] Known Knowns - something we are aware of and understand
		~ {1.a.} High CPU Usage / CPU Throttling
		~ {1.b.} Version Upgrade Issues
		~ {1.c.} OS Kernel Incompatibilities
		~ {1.d} Rebooted master node (alert fired)

	~ [2] Known Unknowns - something we are aware of but don't understand
		~ {2.a.} CPU Usage Variability
		~ {2.b.} 100% of kube-proxy Targets Down
		~ {2.c.} Random Pod Restarts

	~ [3] Unknown Knowns - something we understand but aren't aware of
		~ {3.a.} Total Time to Restart Clusters
		~ {3.b.} Total Time to Rollback Kernel on All Clusters
		~ {3.c.} Performance with Random Pod Termination

	~ [4] Unknown Unknowns - something we don't understand and aren't aware of
		~ {4.a} Behavior with 100% Network Saturation
		~ {4.b} Behavior with Complete Cluster Shutdown
		~ {4.c} Behavior with High CPU Saturation



					::: TOOLS :::

  [Chaos Kube]
~ Periodic and arbitrary termination of pods in a cluster
~ Link: https://github.com/linki/chaoskube

  -------------------------------------------------------------------------------
 [              Advantages              |           Disadvantages               ]
  -------------------------------------------------------------------------------
  | * Easy installation with Helm       | * Pod termination is random by        |
  |                                     |   default which prevents fine-tuned   |
  | * Time-related arguments in chart,  |   without custom helmcharts           |
  |   probably the simplest to setup    |                                       |
  |                                     |                                       |
  | * Filtering via label, annotation,  |                                       |
  |   namespace, minimum age in CLI     |                                       |
  |                                     |                                       |
  -------------------------------------------------------------------------------


  [Kube Monkey]
~ Terminates pods based on an opt-in system per k8s application
~ Link: https://github.com/asobti/kube-monkey

  -------------------------------------------------------------------------------
 [              Advantages              |               Disadvantages            ]
  -------------------------------------------------------------------------------
  | * Opt-in system is good default     | * Environment variables written in    |
  |                                     |   .toml file at /etc/kube-monkey      |
  | * Customizable schedule per         |                                       |
  |   deployment for pod killing        | * Uses glog (Google Logging Library)  |
  |                                     |                                       |
  | * Kill options are fine-grain, e.g. |                                       |
  |   kill all, fixed #, % range        |                                       |
  |                                     |                                       |
  | * Each opted-in attribute specified |                                       |
  |   in .yaml as metadata labels       |                                       |
  |                                     |                                       |
  | * Docker image available            |                                       |
  |                                     |                                       |
  | * Can be run manually as k8s app    |                                       |
  |   or via helm chart                 |                                       |
  |                                     |                                       |
  -------------------------------------------------------------------------------


  [Pod Reaper]
~ Terminates pods that meet conditions based on a configurable set of rules
~ Link: https://github.com/target/pod-reaper

  -------------------------------------------------------------------------------
 [              Advantages              |               Disadvantages            ]
  -------------------------------------------------------------------------------
  | * Developed by Target               | * GitHub repo reflects infrequent     |
  |                                     |   changes (most recent 6mo ago)       |
  | * Filter on namespace, time of day, |                                       |
  |   pod labels                        | * Less functionality than the other   |
  |                                     |   open-source repos                   |
  |                                     |                                       |
  -------------------------------------------------------------------------------


  [Gremlin]
~ Product with a suite of tools to stress-test infrastructure as well as individual applications
~ Link: https://www.gremlin.com/product/

  -------------------------------------------------------------------------------
 [              Advantages              |               Disadvantages            ]
  -------------------------------------------------------------------------------
  | * By far the largest suite of       | * Costs money to implement, I would   |
  |   tools out of the 5                |   require the pro version             |
  |                                     |                                       |
  | * Much more granular blast radius,  | * Cost is unclear from the website    |
  |   ability to by infra-wide or       |                                       |
  |   app-specific                      | * Free version is as feature-sparse   |
  |                                     |   as other repos                      |
  -------------------------------------------------------------------------------

	~ Additional Notes:
		~ {Free Version}:
			~ 3 Users | 5 Hosts
			~ 2 Attack Types: Shutdown | CPU
		~ {Pro Version}:
			~ Unlimited Users | 10+ Hosts
			~ 12 Attack Types: Shutdown | CPU | Black Hole | DNS
					   DNS | Latency | Packet Loss | I/O
					   Disk | Time Travel | Memory
					   Process Killer | App-Level
			~ Training, Guided GameDays, Dedicated Support


  [Powerful Seal]
~ Terminates targeted pods and can take VMs up and down
~ Link: https://github.com/bloomberg/powerfulseal

  -------------------------------------------------------------------------------
 [              Advantages              |               Disadvantages            ]
  -------------------------------------------------------------------------------
  | * Developed by Bloomberg            | * Seal has to SSH into nodes?         |
  |                                     |                                       |
  | * GitHub repo reflects it is being  | * Uses tox to test Python, not        |
  |   actively developed                |   consistent with pytest              |
  |                                     |                                       |
  | * Built-in support for OpenStack,   |                                       |
  |   AWS, and Azure setups             |                                       |
  |                                     |                                       |
  | * 4 different modes to operate in   |                                       |
  |                                     |                                       |
  | * Parameters for iterative/auto     |                                       |
  |   modes are fine-grain              |                                       |
  |                                     |                                       |
  | * Log collector via stdout or       |                                       |
  |   Prometheus                        |                                       |
  |                                     |                                       |
  | * Web UI for logs and configs       |                                       |
  |                                     |                                       |
  -------------------------------------------------------------------------------

	~ Additional Notes:
		~ {Interactive Mode}: manually inject attacks (operates on nodes, pods, deployments, namespaces)
		~ {Autonomous Mode}: reads policy file that contains pod/node scenarios, each scenario lists matches/filters/actions, 
		                     executed on loop in cluster
		~ {Label Mode}: specify which pods to kill with a few options (seal/ label in pods); imperative alternative to autonomous mode
		~ {Demo Mode}: point at a cluster and a Heapster server and let it attempt to figure out what to kill based on resource usage



					::: KEY TAKEAWAYS :::

~ For a Proof of Concept, either PowerfulSeal or KubeMonkey should be used in an experimental environment
~ If project value increases over time, it would be worth considering a license for Gremlin's professional suite
~ It should be fairly straightforward to implement a chaos scheduler in existing Kubernetes clusters
~ PowerfulSeal has modes, which would be very useful for prototyping (start in manual, write autonomous after testing)
~ KubeMonkey has fine-grain options for pod termination, and can be run as a k8s app or helm chart (if I need hyper-specific conditions to terminate)