ChaosEater: Automated Chaos Engineering

Updated 18 November 2025

ChaosEater is an automated system for chaos engineering on Kubernetes, leveraging LLM agents to cover the entire CE cycle.
It decomposes CE into agentic micro-tasks—including fault injection and remediation—ensuring efficient, low-cost resilience validation.
Validated on real-world benchmarks like Nginx and SockShop, ChaosEater demonstrates rapid cycle times and minimal manual intervention.

ChaosEater is an automated system for performing Chaos Engineering (CE) on Kubernetes-based software systems, leveraging agentic orchestration of LLMs to execute the entire CE cycle—hypothesis generation, experiment design and injection, analysis, and remediation—at low cost and with minimal manual intervention. It targets Infrastructure-as-Code paradigms, assigning granular engineering tasks to specialized LLM agents, and is validated on multiple real-world microservice benchmarks including Nginx and SockShop (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).

1. Conceptual Foundations of ChaosEater

Modern distributed systems, typically comprised of microservices on Kubernetes, exhibit complex interdependencies that render resiliency unpredictable under fault conditions. Chaos Engineering is a discipline focused on deliberate fault injection to identify system weaknesses before production failures. The canonical CE cycle consists of four phases:

Hypothesis (steady-state definition, fault scenario planning)
Experiment (fault injection, monitoring)
Analysis (failure detection, root cause analysis)
Improvement (system reconfiguration, verification)

Industry tools (Netflix Chaos Monkey, AWS FIS, Azure Chaos Studio, Chaos Mesh) automate only experiment execution and metrics collection, leaving hypothesis formulation and remediation highly manual and labor-intensive. ChaosEater was designed to remove those manual engineering bottlenecks by decomposing CE into agentic micro-tasks orchestrated and solved by LLMs, enabling system-agnostic, fully automated CE cycles (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).

2. System Architecture and Agentic Workflow

ChaosEater's architecture is a phased agentic workflow, with each phase subdivided into specialized LLM-driven agents:

Phase	Agent Tasks	Artifacts / Outputs
0. Pre-processing	Cluster deployment, manifest summary, resilience gap analysis	Normalized context, candidate issues
1. Hypothesis	VaC metric selection, inspection script codegen, threshold setting	VaC scripts (Validation as Code)
	Failure scenario description, Chaos Mesh manifest generation	Formal hypothesis, structured fault JSON
2. Experiment	Timeline planning, fault injection, VaC orchestration	Chaos Mesh Workflow CRD, experiment logs
3. Analysis	Log aggregation, test outcome assessment, report writing	Failure/countermeasure report
4. Improvement	Manifest (YAML) patch proposal, redundant resource deployment	Updated manifests, improvement diffs
Extra-Post	Comprehensive cycle summary, modified IaC archive	Auditable summary, actionable IaC bundle

During steady-state definition, LLM agents emit Python or k6/JavaScript scripts querying K8s APIs or endpoints, sample metrics, and add assertion logic—the scripts serving as automated unit-like "VaC" (Validation as Code) checks. Failure scenario agents select realistic disruptions (e.g., NetworkPartition, PodKill, StressCPU) from Chaos Mesh, producing parameterized JSON descriptors. The experiment phase involves orchestrating and applying Chaos Mesh Workflow CRDs, monitoring VaC script execution pre-, during, and post-fault. Analysis agents evaluate assertion outcomes and generate structured root-cause/countermeasure reports. Improvement agents propose reconfigurations (e.g., replacing Pod with Deployment, incrementing replica count), applying minimal changes to restore assertions under fault, iteratively looping until all VaC checks pass. Post-processing agents summarize the full cycle for audit and reproducibility (Kikuta et al., 11 Nov 2025).

3. Formalism and Metrics

ChaosEater applies concise formal definitions for resilience evaluation and cost tracking:

Steady-State Assertion: For each monitored metric $m$ , with threshold $T_m$ and interval $t \in [0, \Delta]$ :

$m(t)\;\le\;T_m$ or $m(t)\;\ge\;T_m$

Availability Improvement:

$\Delta A = A_{\rm after} - A_{\rm before}$

Where $A$ is fraction of time VaC scripts pass during fault scenarios.

Monetary Cost Model:

$C = \sum_{p \in \{\text{phases}\}} c_p \cdot t_p$

Here $c_p$ is API per-token (or per-second) cost, $t_p$ is LLM agent wall time for phase $p$ .

Speedup Over Human Baseline:

$\text{Speedup} = \frac{T_{\rm human}}{T_{\rm ChaosEater}}$

This approach ensures technical transparency regarding efficiency, costs, and improvement outcomes (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).

4. Implementation Specifics

ChaosEater relies on Kubernetes-native mechanisms:

Cluster Deployment/Orchestration: Uses Skaffold for manifest bootstrapping and redeployment after every agent-driven modification.
Fault Injection: Generates and applies Chaos Mesh Workflow CRDs directly via kubectl; no sidecar or operator extension required beyond default Chaos Mesh.
LLM Agent Chain: Underlying agents run on OpenAI GPT-4o (temperature 0), each with specialized system prompts, few-shot schema templating, and contextual memory limited to active phase. Conversation logs and templates are published for full reproducibility.
Self-Debugging Capability: Agents repair their own output (e.g., fix YAML/Python/k6 syntax) using error logs passed in verification loops.
Output Delivery: Returns both a full narrative summary and a modified IaC folder (manifests plus Skaffold config) post-cycle (Kikuta et al., 19 Jan 2025).

5. Experimental Results and Cost Analysis

ChaosEater was validated on two benchmarks:

Nginx (2 manifests, Pod+Service):
- Median CE cycle: ~11 min, \$0.21 API cost per run.
- Replaced Pod with Deployment, increased replicas; stable, fully automated in all runs.
SockShop (29 manifests, large-scale):
- Median CE cycle: ~25 min, \$0.84 API cost per run.
- Added redundancy to front-end if needed (replicas increased from 1 to 2); converged in 4/5 runs, stable with no runtime errors.

Phase-wise breakdown (SockShop, median):

Phase	Time (min)	Cost (\$)
Pre-process	4.6	0.13
Hypothesis	4.3	0.41
Experiment	3.3	0.16
Analysis	0.6	0.04
Improvement	4.3	0.04
Post-process	0.4	0.05

CE cycles typically converge within 1–2 improvement loops and are completed without errors (Kikuta et al., 11 Nov 2025).

6. Qualitative Validation and Human Assessment

ChaosEater’s outputs were evaluated by external human SRE engineers as well as LLM-based raters (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5), each reviewing cycle artifacts including hypotheses, experiment plans, analysis reports, and remediation manifests. Average scores across categories (“Hypothesis,” “Experiment Plan,” “Analysis,” “Improvement,” “Overall”) all exceeded 3.0, corresponding to actionable and reasonable cycles. Human review confirmed best-practice implementation, such as converting stateless Pods to Deployments with redundancy and providing minimal, direct configuration changes (Kikuta et al., 11 Nov 2025). Self-debugging loops demonstrated reliable error correction without human intervention.

7. Scope, Limitations, and Prospects

ChaosEater’s scope is currently confined to Kubernetes manifest (YAML) reconfigurations; it does not modify application code, infrastructure-as-code beyond K8s, or front-end assets. Operational use is restricted to dev/staging clusters since output is not yet audit-controlled for production risk. The CE cycles are single-shot; prolonged or multi-cycle CE for long-horizon vulnerability discovery is not yet implemented.

Key limitations include:

Requirement for blast-radius control and rollback in production environments.
Tight coupling of agent prompt templates to GPT-4o family.
Detection of only shallow vulnerabilities in well-hardened systems in single cycles.

Planned future research directions are:

Multi-cycle CE and historical telemetry learning for deeper risk quantification.
Model-agnostic prompt tuning and fine-tuning for CE agent corpora.
Porting the agentic workflow to alternative orchestrators (ECS, Nomad, serverless).
Policy-engine integration for guardrails and compliance.
Graph-based selection for scaling CE to very large microservice graphs.

A plausible implication is that embedding ChaosEater into CI/CD pipelines provides continuous, automated resilience validation for microservice releases. Its auditable outputs facilitate compliance and forensic analysis. The system reduces CE cost to less than \$1 and cycle time to under 30 minutes for common applications without requiring CE expertise (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).

PDF Markdown Chat (Pro)

References (2)

LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost (2025)

ChaosEater: Fully Automating Chaos Engineering with Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ChaosEater.

ChaosEater: Automated Chaos Engineering

1. Conceptual Foundations of ChaosEater

2. System Architecture and Agentic Workflow

3. Formalism and Metrics

4. Implementation Specifics

5. Experimental Results and Cost Analysis

6. Qualitative Validation and Human Assessment

7. Scope, Limitations, and Prospects

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ChaosEater: Automated Chaos Engineering

1. Conceptual Foundations of ChaosEater

2. System Architecture and Agentic Workflow

3. Formalism and Metrics

4. Implementation Specifics

5. Experimental Results and Cost Analysis

6. Qualitative Validation and Human Assessment

7. Scope, Limitations, and Prospects

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research