Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

ChaosEater: Automated Chaos Engineering

Updated 18 November 2025
  • ChaosEater is an automated system for chaos engineering on Kubernetes, leveraging LLM agents to cover the entire CE cycle.
  • It decomposes CE into agentic micro-tasks—including fault injection and remediation—ensuring efficient, low-cost resilience validation.
  • Validated on real-world benchmarks like Nginx and SockShop, ChaosEater demonstrates rapid cycle times and minimal manual intervention.

ChaosEater is an automated system for performing Chaos Engineering (CE) on Kubernetes-based software systems, leveraging agentic orchestration of LLMs to execute the entire CE cycle—hypothesis generation, experiment design and injection, analysis, and remediation—at low cost and with minimal manual intervention. It targets Infrastructure-as-Code paradigms, assigning granular engineering tasks to specialized LLM agents, and is validated on multiple real-world microservice benchmarks including Nginx and SockShop (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).

1. Conceptual Foundations of ChaosEater

Modern distributed systems, typically comprised of microservices on Kubernetes, exhibit complex interdependencies that render resiliency unpredictable under fault conditions. Chaos Engineering is a discipline focused on deliberate fault injection to identify system weaknesses before production failures. The canonical CE cycle consists of four phases:

  1. Hypothesis (steady-state definition, fault scenario planning)
  2. Experiment (fault injection, monitoring)
  3. Analysis (failure detection, root cause analysis)
  4. Improvement (system reconfiguration, verification)

Industry tools (Netflix Chaos Monkey, AWS FIS, Azure Chaos Studio, Chaos Mesh) automate only experiment execution and metrics collection, leaving hypothesis formulation and remediation highly manual and labor-intensive. ChaosEater was designed to remove those manual engineering bottlenecks by decomposing CE into agentic micro-tasks orchestrated and solved by LLMs, enabling system-agnostic, fully automated CE cycles (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).

2. System Architecture and Agentic Workflow

ChaosEater's architecture is a phased agentic workflow, with each phase subdivided into specialized LLM-driven agents:

Phase Agent Tasks Artifacts / Outputs
0. Pre-processing Cluster deployment, manifest summary, resilience gap analysis Normalized context, candidate issues
1. Hypothesis VaC metric selection, inspection script codegen, threshold setting VaC scripts (Validation as Code)
Failure scenario description, Chaos Mesh manifest generation Formal hypothesis, structured fault JSON
2. Experiment Timeline planning, fault injection, VaC orchestration Chaos Mesh Workflow CRD, experiment logs
3. Analysis Log aggregation, test outcome assessment, report writing Failure/countermeasure report
4. Improvement Manifest (YAML) patch proposal, redundant resource deployment Updated manifests, improvement diffs
Extra-Post Comprehensive cycle summary, modified IaC archive Auditable summary, actionable IaC bundle

During steady-state definition, LLM agents emit Python or k6/JavaScript scripts querying K8s APIs or endpoints, sample metrics, and add assertion logic—the scripts serving as automated unit-like "VaC" (Validation as Code) checks. Failure scenario agents select realistic disruptions (e.g., NetworkPartition, PodKill, StressCPU) from Chaos Mesh, producing parameterized JSON descriptors. The experiment phase involves orchestrating and applying Chaos Mesh Workflow CRDs, monitoring VaC script execution pre-, during, and post-fault. Analysis agents evaluate assertion outcomes and generate structured root-cause/countermeasure reports. Improvement agents propose reconfigurations (e.g., replacing Pod with Deployment, incrementing replica count), applying minimal changes to restore assertions under fault, iteratively looping until all VaC checks pass. Post-processing agents summarize the full cycle for audit and reproducibility (Kikuta et al., 11 Nov 2025).

3. Formalism and Metrics

ChaosEater applies concise formal definitions for resilience evaluation and cost tracking:

  • Steady-State Assertion: For each monitored metric mm, with threshold TmT_m and interval t[0,Δ]t \in [0, \Delta]:

m(t)    Tmm(t)\;\le\;T_m or m(t)    Tmm(t)\;\ge\;T_m

  • Availability Improvement:

ΔA=AafterAbefore\Delta A = A_{\rm after} - A_{\rm before}

Where AA is fraction of time VaC scripts pass during fault scenarios.

  • Monetary Cost Model:

C=p{phases}cptpC = \sum_{p \in \{\text{phases}\}} c_p \cdot t_p

Here cpc_p is API per-token (or per-second) cost, tpt_p is LLM agent wall time for phase pp.

  • Speedup Over Human Baseline:

Speedup=ThumanTChaosEater\text{Speedup} = \frac{T_{\rm human}}{T_{\rm ChaosEater}}

This approach ensures technical transparency regarding efficiency, costs, and improvement outcomes (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).

4. Implementation Specifics

ChaosEater relies on Kubernetes-native mechanisms:

  • Cluster Deployment/Orchestration: Uses Skaffold for manifest bootstrapping and redeployment after every agent-driven modification.
  • Fault Injection: Generates and applies Chaos Mesh Workflow CRDs directly via kubectl; no sidecar or operator extension required beyond default Chaos Mesh.
  • LLM Agent Chain: Underlying agents run on OpenAI GPT-4o (temperature 0), each with specialized system prompts, few-shot schema templating, and contextual memory limited to active phase. Conversation logs and templates are published for full reproducibility.
  • Self-Debugging Capability: Agents repair their own output (e.g., fix YAML/Python/k6 syntax) using error logs passed in verification loops.
  • Output Delivery: Returns both a full narrative summary and a modified IaC folder (manifests plus Skaffold config) post-cycle (Kikuta et al., 19 Jan 2025).

5. Experimental Results and Cost Analysis

ChaosEater was validated on two benchmarks:

  • Nginx (2 manifests, Pod+Service):
    • Median CE cycle: ~11 min, \$0.21 API cost per run.
    • Replaced Pod with Deployment, increased replicas; stable, fully automated in all runs.
  • SockShop (29 manifests, large-scale):
    • Median CE cycle: ~25 min, \$0.84 API cost per run.
    • Added redundancy to front-end if needed (replicas increased from 1 to 2); converged in 4/5 runs, stable with no runtime errors.

Phase-wise breakdown (SockShop, median):

Phase Time (min) Cost (\$)
Pre-process 4.6 0.13
Hypothesis 4.3 0.41
Experiment 3.3 0.16
Analysis 0.6 0.04
Improvement 4.3 0.04
Post-process 0.4 0.05

CE cycles typically converge within 1–2 improvement loops and are completed without errors (Kikuta et al., 11 Nov 2025).

6. Qualitative Validation and Human Assessment

ChaosEater’s outputs were evaluated by external human SRE engineers as well as LLM-based raters (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5), each reviewing cycle artifacts including hypotheses, experiment plans, analysis reports, and remediation manifests. Average scores across categories (“Hypothesis,” “Experiment Plan,” “Analysis,” “Improvement,” “Overall”) all exceeded 3.0, corresponding to actionable and reasonable cycles. Human review confirmed best-practice implementation, such as converting stateless Pods to Deployments with redundancy and providing minimal, direct configuration changes (Kikuta et al., 11 Nov 2025). Self-debugging loops demonstrated reliable error correction without human intervention.

7. Scope, Limitations, and Prospects

ChaosEater’s scope is currently confined to Kubernetes manifest (YAML) reconfigurations; it does not modify application code, infrastructure-as-code beyond K8s, or front-end assets. Operational use is restricted to dev/staging clusters since output is not yet audit-controlled for production risk. The CE cycles are single-shot; prolonged or multi-cycle CE for long-horizon vulnerability discovery is not yet implemented.

Key limitations include:

  • Requirement for blast-radius control and rollback in production environments.
  • Tight coupling of agent prompt templates to GPT-4o family.
  • Detection of only shallow vulnerabilities in well-hardened systems in single cycles.

Planned future research directions are:

  • Multi-cycle CE and historical telemetry learning for deeper risk quantification.
  • Model-agnostic prompt tuning and fine-tuning for CE agent corpora.
  • Porting the agentic workflow to alternative orchestrators (ECS, Nomad, serverless).
  • Policy-engine integration for guardrails and compliance.
  • Graph-based selection for scaling CE to very large microservice graphs.

A plausible implication is that embedding ChaosEater into CI/CD pipelines provides continuous, automated resilience validation for microservice releases. Its auditable outputs facilitate compliance and forensic analysis. The system reduces CE cost to less than \$1 and cycle time to under 30 minutes for common applications without requiring CE expertise (Kikuta et al., 11 Nov 2025, Kikuta et al., 19 Jan 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ChaosEater.