CleanAgent: Safe & Efficient AI Agents
- CleanAgent is a framework that ensures safe, efficient AI operations by combining modular, schema-driven orchestration with declarative data standardization.
- The system integrates a multi-agent LLM stack and a one-shot, hands-free workflow for automated data cleaning, reducing manual coding and errors.
- Formal methods such as capability tracking and role separation are employed to enforce system safety, prevent context pollution, and enhance performance.
CleanAgent denotes a set of methodologies and system architectures aimed at facilitating safer, more robust, and efficient AI agent operation, especially in code-as-action and data standardization settings. The concept encompasses concrete implementations ranging from declarative data standardization with LLM-driven automation in Python to statically enforced safety harnesses in capability-safe languages like Scala 3, as well as meta-architectures such as multi-agent role separation to counteract context pollution. CleanAgent emphasizes modular, schema-driven orchestration, explicit separation of strategic and implementation states, and the minimization of failure-prone or dangerous code patterns.
1. Architectural Origins and Motivation
CleanAgent frameworks emerge in response to two classes of challenges: (a) the practical complexity and fragility of automating data transformations with LLMs and (b) the need for principled safety and reliability guarantees in agents that generate or execute code. For data tasks, legacy approaches demanded manual, error-prone coding or brittle prompt engineering. Simultaneously, in interactive agent environments, risks related to context pollution, information leakage, and untracked side effects became prohibitive, motivating designs grounded in strict type systems and modular agent orchestration (Qi et al., 2024, Fei et al., 21 Jan 2026, Odersky et al., 1 Mar 2026).
2. Declarative Agent-Driven Data Standardization
A primary instantiation of CleanAgent is the CleanAgent data standardization framework, integrating Dataprep.Clean (a Python library offering one-line standardization APIs for diverse column types) with a multi-agent LLM stack. The typical architecture comprises:
- Dataprep.Clean library: Exposes APIs, currently shipping with 142 type-specialized cleaners (date, address, phone, IP, etc.), each effecting a Split–Validate–Transform pipeline per invocation.
- LLM-based multi-agent system: Implements strict role decomposition: a Chat Manager (global memory & orchestration), Column-type Annotator (schema inference), Python Programmer (one-shot code generation using Dataprep.Clean), and Code Executor (sandboxed execution).
- Web application interface: A front end for data upload and progress visualization and a back end built on Flask or FastAPI, calling into LLM-driven CleanAgent pipelines.
This architecture supports a one-shot, hands-free workflow: upon specifying requirements (e.g., “standardize ‘admission’ to MM/DD/YYYY hh:mm:ss, ‘addr’ to {street},LA,{zipcode}”), the system auto-annotates, generates code, and applies cleaning functions until successful output is produced (Qi et al., 2024).
3. Formal Methods and Safety for Code-as-Action Agents
For agent safety in environments that involve critical resource manipulation or classified data, CleanAgent denotes a “safety harness” construction realized in a capability-safe language, concretely Scala 3 with capture checking. The system statically regulates effectful actions and information flow by encoding resource access as first-class “capabilities”: tracked program variables subject to type-and-effect discipline.
Key formal apparatus includes:
- Types and Capture Sets: Tracked function types that capture only capabilities in , with as the pure () case.
- Subtyping via Set Inclusion: Capability sets are partially ordered; if .
- Typing Judgments: Formally, denotes ’s type and its set of effectful resource dependencies.
- Local Purity Enforcement: All subcomputations that process classified data must type-check as capability-free , e.g., as required by on .
- Safety Theorems: Preservation and progress hold; critically, “no forging” ensures capabilities cannot be unsafely constructed, and noninterference ensures pure functions never leak classified data (Odersky et al., 1 Mar 2026).
Accepted and statically rejected code examples exemplify the prevention of disallowed side effects such as unintended output or information flow, with all capability flows statically traced.
4. Multi-Agent Role Separation and State Management
CleanAgent is further generalized by the CodeDelegator framework, which addresses “context pollution” in long-horizon code-as-action agents by dividing responsibilities between two agent types:
- Delegator (strategic planner): Decomposes tasks, writes formal specifications, and monitors progress; never executes code.
- Coder (ephemeral implementer): Receives a clean, minimal subtask specification, generates and executes code in an isolated environment, and returns only high-level results.
This duality is captured by Ephemeral-Persistent State Separation (EPSS), where the persistent orchestration state tracks only validated artifacts and specifications, and all execution traces or runtime errors remain confined to ephemeral Coder state. As a result, planning context remains uncontaminated by low-level implementation failures. The overall effect is quantifiably improved long-horizon success, as demonstrated empirically (Fei et al., 21 Jan 2026).
5. Evaluation and Empirical Performance
Data Standardization Setting
Internal measurements and user studies of CleanAgent in data cleaning demonstrate:
| Method | Avg. LOC/col | Implementation Time (min) |
|---|---|---|
| Manual pandas+regex | 80 | 25 |
| ChatGPT‐prompted code | 40 | 15 |
| CleanAgent | 2 | 5 |
User studies (n=10) report >75% time savings (8/10), high adequacy for one-shot requirement entry (9/10), and <5% code error rate (mostly trivial) (Qi et al., 2024).
Agent Safety and Context Separation
In adversarial code generation settings:
- CleanAgent capability-safe agents statically prevented 100% of information leak attempts, while unclassified string-based approaches allowed model-dependent leakage (Sonnet: 98.5%, MiniMax: 91.6% protection).
- Utility rates for task performance remained comparable or improved, e.g., τ²-bench airline (CleanAgent: 45.2% vs. 43.8%), retail (57.0% vs. 53.3%) (Odersky et al., 1 Mar 2026).
In role-separated, multi-agent task decomposition (CodeDelegator on τ²-bench and MCPMark):
| Method / Domain | pass¹ (%) | pass² | pass³ | pass⁴ |
|---|---|---|---|---|
| ReAct (Retail) | 79.6 | 69.9 | 63.4 | 58.8 |
| CodeAct (Retail, single) | 70.2 | 59.0 | 50.0 | 47.0 |
| CodeDelegator (Retail) | 82.0 | 71.2 | 63.4 | 57.0 |
Similar improvements are observed on diverse agent-environment benchmarks (Fei et al., 21 Jan 2026).
6. Implementation Guidance and Limitations
- Declarative, type-specific APIs drastically minimize the LLM learning surface and reduce failure rates by ≈50% (Qi et al., 2024).
- Role-based decomposition, strict separation of concerns, and schema-driven state handoff are critical for reliability and scalability.
- Out-of-the-box CleanAgent approaches do not address heavily nested/multi-modal fields or extremely wide data tables; batching and custom handler implementation are required in such cases.
- Sequential subtasking is enforced in current multi-agent designs; extensions for parallel/DAG task plans are plausible future directions (Fei et al., 21 Jan 2026).
7. Broader Impact and Generalization
CleanAgent principles—explicit schema-driven orchestration, capability tracking for all effectful operations, and the elimination of context pollution via state separation—extend naturally to domains beyond tabular data and code: complex data science ETL pipelines, robotic control systems requiring safe resource access, and scenarios where strategic reasoning must remain orthogonal to detailed execution logs. Any system adhering to strict role separation, structured state management, and tracked “capabilities” can leverage CleanAgent methodologies for improved safety, performance, and scalability (Fei et al., 21 Jan 2026, Odersky et al., 1 Mar 2026).