ChaosEater: Automated Chaos Engineering
- ChaosEater is an automated system for chaos engineering on Kubernetes, leveraging LLM agents to cover the entire CE cycle.
- It decomposes CE into agentic micro-tasks—including fault injection and remediation—ensuring efficient, low-cost resilience validation.
- Validated on real-world benchmarks like Nginx and SockShop, ChaosEater demonstrates rapid cycle times and minimal manual intervention.
ChaosEater is an automated system for performing Chaos Engineering (CE) on Kubernetes-based software systems, leveraging agentic orchestration of LLMs to execute the entire CE cycle—hypothesis generation, experiment design and injection, analysis, and remediation—at low cost and with minimal manual intervention. It targets Infrastructure-as-Code paradigms, assigning granular engineering tasks to specialized LLM agents, and is validated on multiple real-world microservice benchmarks including Nginx and SockShop (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).
1. Conceptual Foundations of ChaosEater
Modern distributed systems, typically comprised of microservices on Kubernetes, exhibit complex interdependencies that render resiliency unpredictable under fault conditions. Chaos Engineering is a discipline focused on deliberate fault injection to identify system weaknesses before production failures. The canonical CE cycle consists of four phases:
- Hypothesis (steady-state definition, fault scenario planning)
- Experiment (fault injection, monitoring)
- Analysis (failure detection, root cause analysis)
- Improvement (system reconfiguration, verification)
Industry tools (Netflix Chaos Monkey, AWS FIS, Azure Chaos Studio, Chaos Mesh) automate only experiment execution and metrics collection, leaving hypothesis formulation and remediation highly manual and labor-intensive. ChaosEater was designed to remove those manual engineering bottlenecks by decomposing CE into agentic micro-tasks orchestrated and solved by LLMs, enabling system-agnostic, fully automated CE cycles (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).
2. System Architecture and Agentic Workflow
ChaosEater's architecture is a phased agentic workflow, with each phase subdivided into specialized LLM-driven agents:
| Phase | Agent Tasks | Artifacts / Outputs |
|---|---|---|
| 0. Pre-processing | Cluster deployment, manifest summary, resilience gap analysis | Normalized context, candidate issues |
| 1. Hypothesis | VaC metric selection, inspection script codegen, threshold setting | VaC scripts (Validation as Code) |
| Failure scenario description, Chaos Mesh manifest generation | Formal hypothesis, structured fault JSON | |
| 2. Experiment | Timeline planning, fault injection, VaC orchestration | Chaos Mesh Workflow CRD, experiment logs |
| 3. Analysis | Log aggregation, test outcome assessment, report writing | Failure/countermeasure report |
| 4. Improvement | Manifest (YAML) patch proposal, redundant resource deployment | Updated manifests, improvement diffs |
| Extra-Post | Comprehensive cycle summary, modified IaC archive | Auditable summary, actionable IaC bundle |
During steady-state definition, LLM agents emit Python or k6/JavaScript scripts querying K8s APIs or endpoints, sample metrics, and add assertion logic—the scripts serving as automated unit-like "VaC" (Validation as Code) checks. Failure scenario agents select realistic disruptions (e.g., NetworkPartition, PodKill, StressCPU) from Chaos Mesh, producing parameterized JSON descriptors. The experiment phase involves orchestrating and applying Chaos Mesh Workflow CRDs, monitoring VaC script execution pre-, during, and post-fault. Analysis agents evaluate assertion outcomes and generate structured root-cause/countermeasure reports. Improvement agents propose reconfigurations (e.g., replacing Pod with Deployment, incrementing replica count), applying minimal changes to restore assertions under fault, iteratively looping until all VaC checks pass. Post-processing agents summarize the full cycle for audit and reproducibility (Kikuta et al., 11 Nov 2025).
3. Formalism and Metrics
ChaosEater applies concise formal definitions for resilience evaluation and cost tracking:
- Steady-State Assertion: For each monitored metric , with threshold and interval :
or
- Availability Improvement:
Where is fraction of time VaC scripts pass during fault scenarios.
- Monetary Cost Model:
Here is API per-token (or per-second) cost, is LLM agent wall time for phase .
- Speedup Over Human Baseline:
This approach ensures technical transparency regarding efficiency, costs, and improvement outcomes (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).
4. Implementation Specifics
ChaosEater relies on Kubernetes-native mechanisms:
- Cluster Deployment/Orchestration: Uses Skaffold for manifest bootstrapping and redeployment after every agent-driven modification.
- Fault Injection: Generates and applies Chaos Mesh Workflow CRDs directly via
kubectl; no sidecar or operator extension required beyond default Chaos Mesh. - LLM Agent Chain: Underlying agents run on OpenAI GPT-4o (temperature 0), each with specialized system prompts, few-shot schema templating, and contextual memory limited to active phase. Conversation logs and templates are published for full reproducibility.
- Self-Debugging Capability: Agents repair their own output (e.g., fix YAML/Python/k6 syntax) using error logs passed in verification loops.
- Output Delivery: Returns both a full narrative summary and a modified IaC folder (manifests plus Skaffold config) post-cycle (Kikuta et al., 19 Jan 2025).
5. Experimental Results and Cost Analysis
ChaosEater was validated on two benchmarks:
- Nginx (2 manifests, Pod+Service):
- Median CE cycle: ~11 min, \$0.21 API cost per run.
- Replaced Pod with Deployment, increased replicas; stable, fully automated in all runs.
- SockShop (29 manifests, large-scale):
- Median CE cycle: ~25 min, \$0.84 API cost per run.
- Added redundancy to front-end if needed (replicas increased from 1 to 2); converged in 4/5 runs, stable with no runtime errors.
Phase-wise breakdown (SockShop, median):
| Phase | Time (min) | Cost (\$) |
|---|---|---|
| Pre-process | 4.6 | 0.13 |
| Hypothesis | 4.3 | 0.41 |
| Experiment | 3.3 | 0.16 |
| Analysis | 0.6 | 0.04 |
| Improvement | 4.3 | 0.04 |
| Post-process | 0.4 | 0.05 |
CE cycles typically converge within 1–2 improvement loops and are completed without errors (Kikuta et al., 11 Nov 2025).
6. Qualitative Validation and Human Assessment
ChaosEater’s outputs were evaluated by external human SRE engineers as well as LLM-based raters (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5), each reviewing cycle artifacts including hypotheses, experiment plans, analysis reports, and remediation manifests. Average scores across categories (“Hypothesis,” “Experiment Plan,” “Analysis,” “Improvement,” “Overall”) all exceeded 3.0, corresponding to actionable and reasonable cycles. Human review confirmed best-practice implementation, such as converting stateless Pods to Deployments with redundancy and providing minimal, direct configuration changes (Kikuta et al., 11 Nov 2025). Self-debugging loops demonstrated reliable error correction without human intervention.
7. Scope, Limitations, and Prospects
ChaosEater’s scope is currently confined to Kubernetes manifest (YAML) reconfigurations; it does not modify application code, infrastructure-as-code beyond K8s, or front-end assets. Operational use is restricted to dev/staging clusters since output is not yet audit-controlled for production risk. The CE cycles are single-shot; prolonged or multi-cycle CE for long-horizon vulnerability discovery is not yet implemented.
Key limitations include:
- Requirement for blast-radius control and rollback in production environments.
- Tight coupling of agent prompt templates to GPT-4o family.
- Detection of only shallow vulnerabilities in well-hardened systems in single cycles.
Planned future research directions are:
- Multi-cycle CE and historical telemetry learning for deeper risk quantification.
- Model-agnostic prompt tuning and fine-tuning for CE agent corpora.
- Porting the agentic workflow to alternative orchestrators (ECS, Nomad, serverless).
- Policy-engine integration for guardrails and compliance.
- Graph-based selection for scaling CE to very large microservice graphs.
A plausible implication is that embedding ChaosEater into CI/CD pipelines provides continuous, automated resilience validation for microservice releases. Its auditable outputs facilitate compliance and forensic analysis. The system reduces CE cost to less than \$1 and cycle time to under 30 minutes for common applications without requiring CE expertise (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free