Code as Agent Harness

Published 18 May 2026 in cs.CL and cs.AI | (2605.18747v1)

Abstract: Recent LLMs have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

Abstract PDF Upgrade to Chat

Authors (42)

First 10 authors:

Summary

The paper's main contribution is the formalization of code as a central substrate for agent reasoning, action, and environment modeling.
It details systematic interfaces, mechanisms, and multi-agent scaling strategies that ensure verifiable, stateful, and adaptive agent performance.
The work introduces methods like program delegation, Plan–Execute–Verify loops, and telemetry-driven harness optimization to enhance agent reliability and safety.

Code as Agent Harness: A Systems-Centric Paradigm for Executable, Verifiable, and Stateful AI Agents

Introduction: The Shift to Code-Centrism in Agentic Architectures

Large-scale LLMs have advanced beyond code generation as an endpoint, increasingly employing code as the central substrate for reasoning, action, memory, and environment modeling within agentic frameworks. This survey formalizes the “code as agent harness” paradigm, positioning code not merely as output, but as the backbone of agent interfaces, operational mechanisms, and multi-agent coordination.

The analysis delineates three systemic layers: the harness interface (how code connects reasoning, acting, and world modeling), harness mechanisms (planning, memory, tool use, feedback control), and scaling the harness (orchestrating multi-agent systems over shared code artifacts). This taxonomy is articulated to structure and clarify the rapidly expanding literature and engineering practices converging on code-centric agent infrastructure.

Figure 1: Taxonomy of code as agent harness, organizing interfaces, mechanisms, and scaling as orthogonal axes.

Harness Interface: Code as Reasoning Substrate, Action Boundary, and World Model

The harness interface layer delineates how code mediates the fundamental agent–environment interface. The executability, inspectability, and statefulness of code distinguish it from latent language outputs, supporting persistent and verifiable agent trajectories.

1. Code for Reasoning:

The transition from latent chain-of-thought to executable, external program-of-thoughts interfaces has proven critical for robust reasoning performance. Methodologies leveraging program delegation, hybrid neural-symbolic integration (e.g., Lean/Coq proof-assistant-based agents), and iterative code-grounded reasoning loops systematically exploit code execution feedback—via interpreters, traces, and formal verification—as supervisory signals for agent refinement. Critically, reinforcement learning from execution feedback and process rewards grounds agent policy optimization in verifiable program outcomes.

2. Code for Acting:

In embodied, GUI, and tool-mediated domains, code manifests as the action interface: policies, controllers, behavior trees, and tool call scripts bridge semantic planning and action execution. The harness synthesizes code artifacts to interface with APIs, symbolic planners, robot controllers, or GUI action spaces, embedding affordance estimation and runtime safety checks into the execution substrate. Persistent code-based skill libraries further support lifelong agentic adaptation and compositional action reuse.

3. Code for Environment Modeling:

Agent-environment coupling is deepened through explicit code-based representations of world state. This includes structured simulators, repositories with state-tracking, test-driven evaluation harnesses, execution traces, and programmatic transitions. Such computational artifacts expose explicit, machine-readable state and observable trajectories, supporting auditability, rollback, and multi-agent synchronization.

Figure 2: Overview of code as the harness interface, connecting agents to reasoning, action, and environment modeling.

Figure 3: Chronological roadmap of harness interface methods by role (reasoning, acting, environment modeling).

Harness Mechanisms: Planning, Memory, Tool Use, Feedback, and Engineering

Harness mechanisms formalize the operational protocols that ensure agent persistence, adaptability, and reliability during long-horizon, iterative execution.

Planning:

Agents externalize long-horizon intent through explicit planning structures (stepwise decompositions, repository-grounded plans, dependency graphs). Harness-level planning mechanisms implement trajectory search (Monte Carlo Tree Search, tree-of-code expansions), control commitment strategies, and orchestrated workflow management with stage-wise role allocation.

Figure 4: Planning mechanisms for agent harnesses: decomposition, structure grounding, search, and orchestration.

Memory/Context Engineering:

Scalable agent operation over complex, stateful environments necessitates memory management beyond prompt expansion. The harness encompasses working memory (trajectory and state tracking), semantic memory (repository retrieval, evidence structuring), experiential memory (reflection, trajectory replay), and multi-agent memory (blackboards, state synchronization), complemented by context compression and state offloading protocols.

Figure 5: Memory and context engineering mechanisms—working, semantic, experiential, and long-term memory for agent harnesses.

Tool Use:

Tool-centric agent design exposes function-oriented APIs, environment-manipulation primitives, verification services (static/dynamic analysis), and workflow orchestrators. Harness infrastructure governs tool invocation, permission boundaries, pre/post-execution hooks, and result sanitization, supporting robust audit and observability.

Figure 6: Tool-using mechanisms in agent harnesses, spanning function retrieval, environment interaction, verification, and orchestration.

Plan–Execute–Verify (PEV) Control Loop:

Code agent reliability emerges from a PEV loop: plans externalize contracts; execution occurs in sandboxed, permissioned environments; and verification leverages deterministic sensors (tests, static analysis, runtime monitors). This loop defines repair, escalation, termination, and rollback, aligning with engineering best practices in industrial harnesses.

Figure 7: PEV loop control within the code agent harness—integrating planning, execution, and verification.

Agentic Harness Engineering:

Harness engineering is elevated to a first-class optimization discipline, leveraging deep telemetry (decision traces, cost/logging, error attribution) and evolution agents to adapt, evaluate, and mutate harness components (prompts, memory policies, tool interfaces, verification instruments) while preserving regression-free, auditable operation.

Figure 8: Harness engineering for adaptive, telemetry-driven harness optimization and self-revision.

Scaling the Harness: Multi-Agent Orchestration and Shared Substrates

As agentic tasks surpass single-agent capabilities, harness scaling via multi-agent orchestration becomes essential. Role-specialized agents (planning, coding, verification, execution, review) communicate and coordinate over persistent, shared code substrates—including repositories, blackboards, execution logs, and formalized shared states.

Figure 9: Scaling the agent harness through multi-agent orchestration—modular substrate, specialization, and adaptive topologies.

Topological patterns span fixed chains, hierarchies, DAGs, and adaptively evolving collaboration structures. Convergence criteria range from test-gated correctness to consensus, performance, and implicit criteria, with execution feedback, agent synchronization, and transactional state management as persistent open challenges.

Figure 10: Roadmap for multi-agent orchestration—collaboration workflows, repository state, verification, and adaptive coordination.

Shared program state representation is highlighted as a critical bottleneck: despite advances in file/repository-based, execution-based, and blackboard architectures, implicit and partial state representation remains a source of system brittleness, limiting coordination, state consistency, and scalable agent deployment.

Application Domains and Emerging Directions

The code-as-harness paradigm enables concrete advances across diverse domains:

Coding Assistants: Repository-focused agents integrating memory, execution feedback, and workflow reasoning. Production systems increasingly operationalize harness loops, context management, and verifiable execution as core differentiators.
GUI/OS Agents: Program-world abstraction for GUI/Web/OS environments, treating rendered and action spaces as structured, executable artifacts, thus supporting scalable benchmarking, closed feedback loops, and production deployment.
Embodied Agents: Hierarchical harnessing of LLMs for semantic planning with code-based skill invocation, integrated simulation, and verifiable physical control, emphasizing lifelong skill reuse and governance under physical constraints.
Scientific Discovery Agents: Program-as-workflow for formalizing the scientific method, hypothesis development, experimental orchestration, and memory as reproducible program state in closed generate–execute–feedback loops.
Personalization and Recommendation: Preference state modeled as inspectable code artifacts, robust feedback loops for policy adaptation, and explicit governance over long-term user modeling and privacy concerns.
Figure 11: Code as agent harness applications across five domains: coding assistants, GUI/OS agents, scientific discovery, personalization, and embodied agents.

Open Challenges and Research Directions

Despite the formal advances of the code-as-harness paradigm, several open problems constrain agent reliability, adaptability, and real-world deployability:

Harness-level Evaluation and Oracle Adequacy: Reliance on end-task accuracy conflates model, harness, and environment. There is a need for metrics probing efficiency, auditability, verifier strength, recovery, state consistency, and safety compliance.
Semantic Verification Beyond Executable Feedback: Execution feedback is only as reliable as the attached oracle. Multi-level verification stacks (tests, static analysis, human review) with explicit contract scope and uncertainty must be standardized.
Self-Evolving Harnesses with Regression Control: Autonomous harness mutation requires rigorous regression-testing, change contracts, rollback, and telemetry standards to minimize risk and instability.
Transactional and Consistent Shared Program State: Coordination in multi-agent systems necessitates transactional semantics over program state, semantic conflict detection, and reconciliation beyond file-level synchronization.
Human-in-the-Loop Safety and Governance: Permissioned execution, accountability, auditable approval gates, and safety invariants must be first-class harness entities, especially in high-impact settings.
Multimodal Harness Engineering: Compression, action grounding, verification, and memory for visual and physical environments require harness abstractions beyond text-centric state representations.

Conclusion

This survey comprehensively formalizes and systematizes the code as agent harness architecture, positioning code as the unifying substrate for agentic AI—spanning reasoning, action, state, verification, and coordination. By dissecting interfaces, mechanisms, and scaling patterns, the analysis exposes both the power and limitations of code-centric design. The implications are clear: code as harness is not only a research lens but is rapidly becoming the core engineering discipline for building reliable, auditable, and stateful AI agents capable of robust operation in increasingly complex, long-horizon, and real-world environments. Persistent challenges remain in verification, safety, harness evaluation, and multi-agent consistency—defining the agenda for future work in harness engineering and AI autonomy (2605.18747).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is a big-picture guide to a new way of building AI “agents” (computer programs that can plan and act). The authors say that code shouldn’t just be something an AI writes at the end of a task. Instead, code should be the main “harness” that connects the AI’s thoughts to real actions, keeps track of what’s going on, and lets us check if the AI is doing the right thing. In short: use code as the AI’s seatbelt, control panel, and notebook—all at once.

What questions does it ask?

The paper looks at simple, practical questions:

How can we make AI agents not just “talk” about plans, but actually do things safely and correctly?
What’s the best way to let an AI think step-by-step, act in the world, and remember progress?
How do we organize all this so one agent or even teams of agents can work together without getting confused?

How do the authors study it?

This is a survey paper. That means the authors read lots of recent research and organize it into a clear map. They explain a three-layer “harness” that puts code at the center:

Layer 1: Interface — code as the way an agent thinks, acts, and understands its environment.
Layer 2: Mechanisms — planning, memory, tools, and feedback loops that keep the agent reliable over time.
Layer 3: Scaling — how multiple agents share code, review each other, and coordinate.

They also connect this map to real applications (like coding helpers, robots, and operating-system automation) and list open problems that still need solving.

What are the main ideas and why do they matter?

Layer 1: Code as the interface for thinking, acting, and seeing the world

Thinking (Reasoning): Instead of only writing out long explanations in plain language, the agent writes small programs to do the tricky parts (like math). Because the programs run, we can check the results. “Executable” means you can press “run.” “Verifiable” means we can test if it worked.
Acting: The agent turns its plans into code that calls tools, clicks buttons, controls robots, or uses apps. This makes actions precise and checkable. If a step fails, the harness can notice and ask the agent to fix it.
Seeing the world (Environment modeling): The agent treats the world like a set of code objects—files, tests, logs, simulations—so it can store what happened, replay it, and compare outcomes. “Stateful” means the agent remembers its history instead of starting from zero each time.

Why it matters: Code makes the agent’s thoughts and actions concrete and testable. That means fewer guesswork mistakes and better safety.

Layer 2: The mechanisms that keep agents dependable

To keep going on long tasks, an agent needs structure. The paper highlights four helpful pieces:

Planning: Breaks a big goal into smaller steps and decides the order.
Memory: Keeps notes, past attempts, useful snippets, and results so the agent doesn’t repeat mistakes.
Tool use: Connects the agent to calculators, databases, APIs, apps, robots, and test suites through code.
Feedback and optimization: Runs the code, reads errors or test failures, and uses that feedback to fix the plan or rewrite code.

Why it matters: This turns an AI from a one-shot answer machine into a steady worker that improves through trial, error, and repair.

Layer 3: Scaling up to teams of agents

The paper explains how multiple agents can share the same code “workspace”:

Different roles (like manager, coder, reviewer, tester) can work together.
Shared artifacts (like tests, logs, and pull requests) help everyone stay on the same page.
Coordinated workflows prevent chaos and let agents check each other’s work.

Why it matters: Complex jobs often need teamwork. Shared code and tests give a common language for collaboration and quality control.

Where can this be used?

Here are a few places where this approach is already helpful:

Coding assistants: Agents write fixes and tests in real software projects.
GUI/OS automation: Agents control apps and operating systems by generating and running commands safely.
Robots and embodied agents: Agents write control code and skill libraries to move, pick, build, and adapt.
Science and data work: Agents build and run analysis pipelines, simulations, and experiments.
Enterprise workflows: Agents coordinate tasks like DevOps, data pipelines, and recommendations with verifiable steps.

What are the main takeaways?

Big idea: Treat code as the core harness for AI agents. Code is how they think (compute), act (execute), and remember (state).
Organized roadmap: A three-layer framework (interface, mechanisms, scaling) that ties together lots of recent methods.
Practical focus: Many examples across real-world tasks.
Open challenges: The authors list important problems that still need work:
- Better evaluation than just “final score” (e.g., check each step, not only the end).
- Verification even when feedback is incomplete or delayed.
- Improving the harness without breaking what already works (“no regressions”).
- Keeping shared state consistent when many agents edit the same workspace.
- Human oversight for risky actions.
- Handling multimodal environments (text, images, code, sensors) together.

Why does this matter for the future?

Putting code at the center makes AI agents more trustworthy. Because their plans and actions are executable and testable, we can catch errors early, track progress, and enforce safety rules. It also makes teamwork—between agents and with humans—much easier, since everyone can look at the same code, tests, and logs.

If these ideas grow, we could see agents that:

Are safer and more reliable in the real world.
Learn from experience without forgetting.
Work together smoothly on complex projects.
Are easier to audit and control.

In simple terms: this paper lays out a plan for building AI helpers that don’t just talk about what they’ll do—they write it down in code, run it, check it, remember it, and improve next time.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated, actionable list of what remains missing, uncertain, or unexplored in the paper’s proposed “code as agent harness” agenda.

Formalization of the harness concept: precise semantics, minimal primitives, and interface contracts for “code as harness” (e.g., types, invariants, pre/postconditions) that yield safety/liveness guarantees across generate–execute–revise loops.
Process-level evaluation beyond final success: standardized, domain-agnostic metrics for intermediate-state correctness, trace quality, recovery from failure, retry budgets, sample efficiency, time-to-fix, and stability over long-running sessions.
Verification under incomplete, noisy, or delayed feedback: principled ways to combine static analysis, unit/property tests, runtime monitors, and probabilistic conformance checks when oracles are partial or absent.
Regression-free harness optimization: methods for safe online improvement (e.g., RL, bandits, BO) with canarying, shadow execution, counterexample generation, and proofs/CI guarantees that critical behaviors do not regress.
Reproducible, stateful evaluation at scale: benchmarks and harnesses that track persistent state across hours/days of operation, measure drift, and reproduce long-horizon trajectories under tool and environment non-determinism.
Shared-state consistency in multi-agent settings: concurrency control, merge policies, CRDT/transactional semantics for code artifacts, provenance tracking, and conflict resolution strategies that preserve correctness under parallel edits.
Trust, roles, and isolation among agents: formal models and systems for capability scoping, sandboxing, data isolation, and cross-agent trust/review when many agents share code, tools, and memory.
Human oversight for safety-critical actions: actionable designs for permissioning, escalation, plan explainability, pre-execution validation, rollback, and HCI workflows that keep humans in control without prohibitive latency.
Multimodal extensions: unified interfaces for incorporating continuous perception streams (vision/audio/sensors) into code-centric harnesses, with real-time constraints and deterministic replay for debugging.
Security and adversarial robustness of code-centric agents: defenses against prompt/tool injection, API misuse, sandbox escapes, dependency supply-chain attacks, data exfiltration, and compromised tool backends; artifact signing and reproducible builds.
Standardized tool and DSL ecosystems: typed schemas, ABI/SDKs, and versioned protocols for tools, validators, tests, and workflows to ensure portability of harness modules across platforms.
Dataset gaps for harness research: open, privacy-preserving corpora of execution traces, tool I/O, failure/recovery episodes, and multi-agent coordination logs with consistent schemas for training and evaluation.
Resource- and cost-aware harness control: methods that trade off latency, compute cost, and correctness (e.g., caching, incremental execution, speculative planning, budgeted search) with formal performance guarantees.
Theoretical foundations of closed-loop stability: models of error propagation, convergence conditions for iterative repair, bounded regret/safety under partial observability, and off-policy evaluation for harness updates.
Process-reward design for code-grounded RL: general recipes for defining, learning, and validating process rewards that avoid reward hacking and preserve long-horizon credit assignment.
Generalization and transfer of executable skills: mechanisms to package, parameterize, and port skills/tests/workflows across tasks, environments, and embodiments, with criteria for when reuse helps or harms.
Bridging formal methods with practical environments: scalable pathways to integrate proof assistants and formal verification into OS/GUI/robotics workflows where specifications are partial and dynamics are stochastic.
Handling API churn and “skill rot”: automated detection of breaking changes, migration of generated skills/workflows, and test-suite evolution to keep long-lived harnesses healthy.
Debuggability and observability of agent harnesses: standardized telemetry, structured logs, causal tracing, and postmortem tooling for root-cause analysis across planning, memory, tools, and execution layers.
Robustness to non-determinism: techniques for determinization, seeding, replay, and statistical comparison when tools, simulators, or environments exhibit variability.
Multi-agent benchmarking: tasks and metrics that quantify coordination quality, shared-state consistency, reviewer/tester effectiveness, adversarial resilience, and productivity vs. overhead under different orchestration topologies.
Governance, compliance, and audit: policy-as-code for permissions, redaction, data retention, secret management, accountability trails, and regulatory reporting integrated into the harness.
Human factors and developer experience: studies and tooling for oversight burden, cognitive load, explainability, and effective UX for authoring/reviewing harness policies, tests, and agent-generated artifacts.
Cross-lingual and multilingual harnesses: consistency of tool schemas, error messages, and code artifacts across languages/locales, and the impact on reasoning, acting, and evaluation.
Open-source reference harnesses: end-to-end, well-instrumented baselines with pluggable components, ablations, and recipes that enable controlled comparisons and rapid reproducibility across domains.

View Paper Prompt View All Prompts

Practical Applications

Below are practical, real-world applications that follow from the paper’s “code as agent harness” perspective. Each item summarizes what the application does, the most relevant sectors, likely tools/products/workflows that could emerge, and key assumptions or dependencies that affect feasibility.

Immediate Applications

Coding copilots that write, run, and verify code changes (Software/DevOps)
- What: Repository-level agents that propose patches, generate tests, execute them in sandboxes/CI, and open PRs with artifacts (patch + tests + logs).
- Tools/workflows: IDE plugins with execution sandboxes; “harness-aware” CI runners; PR bots that attach execution traces and regression tests; policy-as-code guardrails for permissions.
- Assumptions/dependencies: Good unit/integration test coverage; safe sandboxing; permission boundaries; robust telemetry; model competency for target stack.
Autonomous bug triage and issue resolution (Software/DevOps, Enterprise IT)
- What: Agents link issues to code, reproduce failures in a sandbox, author fixes and tests, and iterate via execution feedback until tests pass.
- Tools/workflows: Issue-to-PR pipelines; defect reproducer harness; failure-minimization playbooks; multi-agent roles (planner/coder/tester/reviewer).
- Assumptions/dependencies: Deterministic repro harnesses; reliable error logs; test oracles; human approval checkpoints for risky repos.
GUI and OS automation with verification loops (RPA, Back-office Ops)
- What: Desktop/web agents generate executable UI/OS commands grounded in DOM/A11y trees; actions are verified by render/state checks before committing.
- Tools/workflows: Browser automation with DOM diffs; “preflight” validators; permission brokers; audit logs of actions plus screenshots/state deltas.
- Assumptions/dependencies: Stable selectors/DOM; app-specific APIs; least-privilege permission model; safe rollback capability.
Data/ETL pipeline authoring and maintenance with tests (Data Engineering, Finance, Healthcare IT)
- What: Agents synthesize and maintain ETL code, schema validations, and unit tests; run pipelines in staging, verify metrics, and promote changes.
- Tools/workflows: Declarative pipeline DSLs; data quality test suites; lineage-aware execution logs; promotion gates tied to test thresholds.
- Assumptions/dependencies: Clear data contracts; staging environments; monitoring for drift; access control and PII governance.
Reproducible scientific workflows and notebooks (Academia, R&D, Pharma)
- What: Agents compose hypothesis-testing pipelines (simulation, analysis, visualization) as code; execute, capture traces, and package reproducible notebooks.
- Tools/workflows: Workflow engines (e.g., Nextflow/Snakemake-style harnesses); containerized environments; auto-generated unit tests for analyses.
- Assumptions/dependencies: Accessible datasets/simulators; compute budgets; clear evaluation oracles; human oversight for scientific validity.
Education: executable tutoring and assessment (Education, Workforce Upskilling)
- What: Tutors generate code exercises and tests, provide execution-grounded feedback, and adapt curricula using results from sandbox runs.
- Tools/workflows: Auto-graders; trace-based hints; versioned student skill libraries; “explain-then-run” pedagogy.
- Assumptions/dependencies: Safe sandboxing; calibrated hints; curriculum alignment; anti-plagiarism/academic integrity policies.
Runbook and SRE automation with guardrails (IT Ops, Cloud)
- What: Agents encode runbooks as executable workflows, simulate changes, and execute with approval gates; include rollback scripts and postmortem logs.
- Tools/workflows: IaC/PaC (“policy-as-code”) validators; change windows; blast-radius estimators; progressive rollout orchestrators.
- Assumptions/dependencies: Strong observability; staging/canary infra; explicit approval thresholds; incident response integration.
Regulatory and compliance checks as code (Finance, Healthcare, Public Sector)
- What: Agents translate policies into machine-checkable rules/tests; run them on data/processes and produce auditable evidence.
- Tools/workflows: Policy-as-code repositories; evidence bundling (rules + logs + data snapshots); change control for rule updates.
- Assumptions/dependencies: Unambiguous policy translation; structured access to systems; regulator-accepted audit trails; privacy safeguards.
Personal RPA: end-user task automation with safe previews (Daily Life, SMBs)
- What: Agents automate email triage, spreadsheet ops, billing, and web forms; actions previewed, verified, then executed.
- Tools/workflows: “Dry-run” mode; per-app permissions; explain-why prompts; one-click revert.
- Assumptions/dependencies: Stable app APIs/DOM; user consent and clear UI affordances; local sandboxing for sensitive data.
Multi-agent code review and governance (Software, Enterprise Platforms)
- What: Manager/planner/coder/tester/reviewer agents coordinate over shared repositories and tests to raise code quality and reduce regressions.
- Tools/workflows: Role-based workflows; debate/red teaming for risky code; automated gatekeepers for test and security thresholds.
- Assumptions/dependencies: Orchestrator reliability; consistent shared state; clear escalation to humans for tie-breakers.
Domain tool-use via standardized harness interfaces (Cross-industry)
- What: Agents call structured APIs (databases, CRMs, ERPs) defined by schemas and validators; execution grounded in code-defined tools.
- Tools/workflows: Tool registries with contracts; schema-evolving adapters; telemetry on tool outcomes; fallback/repair loops.
- Assumptions/dependencies: Accurate API schemas; backward-compatible tool changes; robust error handling; access controls.
Test-first feature prototyping (Product Engineering)
- What: Agents generate tests that capture acceptance criteria before writing code; iterate until tests pass and produce PR bundles.
- Tools/workflows: Spec-to-test generation; ephemeral preview envs; trace-based review summaries; auto-link to product tickets.
- Assumptions/dependencies: Clear specs/user stories; reliable test oracles; human product owner sign-off.

Long-Term Applications

Autonomously evolving software systems (Software/DevOps)
- What: Agents that plan, implement, test, and refactor large repositories continuously with regression-free improvement guarantees.
- Tools/workflows: Global dependency analyzers; impact-aware change planners; large-scale, synthesis-driven test generation; formal contracts for modules.
- Assumptions/dependencies: High test coverage or formal specs; scalable codebase understanding; robust rollback and canary deployments.
General-purpose home/industrial robots programmed via code harness (Robotics, Manufacturing, Smart Homes)
- What: Agents generate and verify robot policies as code (behavior trees/constraints), adapt via execution feedback, and share reusable skills across devices.
- Tools/workflows: Safety-certified motion planners; affordance/model-checking layers; standardized skill libraries; fleet-level telemetry and replay.
- Assumptions/dependencies: Reliable perception; strong safety and certification frameworks; standardized hardware abstractions; low-latency control.
Multi-robot swarms coordinated by shared code artifacts (Robotics, Logistics, Agriculture)
- What: Swarm policies synthesized from declarative goals; coordination through shared state and verifiable protocols.
- Tools/workflows: Distributed execution harnesses; consensus/coordination DSLs; simulation-to-real transfer validators.
- Assumptions/dependencies: Robust wireless/edge infra; formal safety constraints; resilient shared-state consistency.
Autonomous laboratory systems for closed-loop discovery (Pharma, Materials, Bio)
- What: Agents design experiments, control instruments, analyze results, and refine hypotheses through executable lab protocols.
- Tools/workflows: Robot lab orchestration; protocol/version control; causal/active learning loops; provenance tracking for every step.
- Assumptions/dependencies: Standardized instrument APIs; safety/ethics approvals; strong physical-world verification and redundancy.
Clinically integrated, verifiable decision support (Healthcare)
- What: Agents generate reproducible analysis pipelines and care-pathway simulations with formal checks and human-in-the-loop gating.
- Tools/workflows: Verified clinical DSLs; evidence bundles for auditors; EHR-integrated preview/approval; post-deployment monitoring.
- Assumptions/dependencies: Regulatory clearance; bias and safety evaluation; strict privacy; alignment with clinical guidelines.
Enterprise digital twins and process optimization via executable world models (Manufacturing, Energy, Supply Chain)
- What: Agents construct and refine code-based simulators of plants/grids, test interventions, and deploy controlling policies with guardrails.
- Tools/workflows: Model calibration pipelines; scenario test batteries; control-theory constraints as code; audit trails for changes.
- Assumptions/dependencies: High-fidelity models; sensor data integration; fail-safes; operator oversight.
Formal verification of agent workflows at scale (Cross-industry, Safety-critical)
- What: Use proof assistants and symbolic checkers to certify agent plans, constraints, and end-to-end workflows before execution.
- Tools/workflows: Domain-specific formal DSLs; auto-proof search; human-readable proof summaries; certification pipelines.
- Assumptions/dependencies: Mature formal methods tooling for domains; cost-performance tradeoffs; skilled oversight.
Policy-as-code for AI agents with continuous compliance (Public Sector, Finance)
- What: Machine-checkable rules that constrain agent actions (data access, export, safety), monitored and enforced in real time.
- Tools/workflows: Runtime policy engines; effect audits with execution traces; regulator portals for live conformance dashboards.
- Assumptions/dependencies: Clear, unambiguous regulations; standardized attestations; verifiable logs accepted by regulators.
Secure automation for critical infrastructure (Energy, SCADA, Transportation)
- What: Agents propose and simulate control changes in digital twins, pass formal safety checks, and execute through gated controllers.
- Tools/workflows: DSLs with safety invariants; intrusion-resistant harnesses; staged deployment sequences; kill-switch governance.
- Assumptions/dependencies: Cybersecurity hardening; operator-in-the-loop; exhaustive scenario testing; liability frameworks.
Multimodal agent harnesses operating across language, vision, and action (Robotics, UI, AR/VR)
- What: Agents that fuse code-grounded reasoning with visual/perceptual state and act across heterogeneous interfaces with consistent shared state.
- Tools/workflows: Multimodal trace stores (text, code, images, sensor logs); cross-modal validators; unified memory for agents.
- Assumptions/dependencies: Consistent state synchronization; robust perception; scalable storage/indexing for multimodal traces.
Market-facing, audited algorithmic services (Finance, Insurance, Marketplaces)
- What: Agents that auto-build pricing/risk models, run backtests, generate compliance reports, and publish “executable notebooks” as audit evidence.
- Tools/workflows: Data lineage and versioning; risk-policy rule engines; reproducible markets simulators; oversight committees workflows.
- Assumptions/dependencies: Data quality; model risk management; real-time monitoring; legal accountability structures.
Organization-scale orchestration of heterogeneous agents (Enterprise Platforms)
- What: Companies run ecosystems of specialized agents (planner/coder/analyst/compliance) coordinating through shared code artifacts and state.
- Tools/workflows: Agent OS with role-based orchestration; shared-state registries; governance policies and SLAs; cross-agent regression tests.
- Assumptions/dependencies: Interop standards; consistent shared state; robust conflict resolution; clear escalation paths.

Notes on feasibility across applications:

Common dependencies: high-quality sandboxes; test oracles; permission and identity management; execution telemetry; cost-aware compute; privacy/security controls; human oversight for safety-critical actions.
Key risks: incomplete feedback leading to silent failures; regression from agent changes; distributed state consistency across multi-agent systems; evaluation beyond final-task success (process metrics, safety, data use).
Enablers: standard tool schemas; policy-as-code; formal verification where critical; trace stores for process rewards and audits; role-based multi-agent workflows.

View Paper Prompt View All Prompts

Glossary

Accessibility APIs: Operating-system interfaces exposing UI elements and actions for automation and assistive access. "DOM trees, accessibility APIs, and executable evaluators"
Affordance: The set of feasible actions an environment allows; models estimate which actions are possible for an agent. "affordance or feasibility models estimate which actions are possible"
Agent harness: The software layer that connects an LLM to tools, execution, state, and feedback, turning it into a functional agent. "An agent harness refers to the software layer that surrounds an LLM with tools, APIs, sandboxes, memory, validators, permission boundaries, execution loops, and feedback channels"
Agent-initiated code artifacts: Executable code objects created and evolved by agents during tasks to reason, act, verify, and store state. "agent-initiated code artifacts, which remain relatively underexplored, are interactive code objects that agents create, execute, observe, revise, persist, and share within the task execution loop."
Agentic AI: An AI paradigm framing systems as autonomous agents that act, learn, and adapt within harnesses. "By centering code as the harness of agentic AI"
Agentic systems: Systems where models operate as agents interacting with environments via reasoning and actions. "In emerging agentic systems, code is no longer only a target output."
API schema: A machine-checkable specification of API endpoints, parameters, and data types used to validate and invoke tools. "API schemas, tool definitions, tests, repositories"
Behavior tree: A hierarchical control structure representing agent behaviors with composable nodes and clear execution semantics. "serve as policies, tool calls, behavior trees, or reusable skills"
Chain-of-thought (CoT): A prompting and reasoning technique that elicits step-by-step intermediate reasoning in text. "pure chain-of-thought (CoT)"
Closed-loop: A feedback-driven control setup where execution outcomes inform subsequent decisions and revisions. "closed-loop agentic behavior."
Code as agent harness: The central view that code is the executable, inspectable, and stateful substrate organizing agent reasoning, action, and verification. "we refer to this view as code as agent harness"
Control-flow structures: Program constructs (e.g., conditionals, loops) that determine the order of execution and branching logic. "execution traces, variable states, control-flow structures, and function-level tests"
Domain-specific language (DSL): A specialized programming language tailored to a particular domain or task, enabling concise, structured control. "DSL programs, executable workflows, reusable skills"
DOM tree: The Document Object Model hierarchical representation of a webpage or UI used to ground GUI actions. "DOM trees, accessibility APIs, and executable evaluators"
Embodied agents: Agents that act in physical or simulated environments through sensors and actuators. "robotic and embodied agents use generated programs as executable policies"
Environment modeling: Representing state, dynamics, and feedback of the task environment via code, simulators, tests, and logs. "Environment Modeling"
Execution-based verification: Checking correctness by actually running code and observing outcomes rather than only inspecting text. "execution-based verification"
Execution loop: The repeated cycle of planning, executing, observing feedback, and revising within an agent harness. "execution loops"
Execution trace: A structured log of program states, outputs, and control flow produced during execution, used for diagnosis and learning. "execution traces"
Formal specification: A mathematically precise, machine-checkable description of desired system behavior. "formal specifications, proof scripts, API schemas"
Formal verification: Use of mathematical proofs and automated checkers to ensure correctness relative to formal specifications. "Formal Verification and Symbolic Reasoning Interfaces"
GUI: Graphical user interface; visual interactive surfaces that agents can operate programmatically. "GUI/OS automation"
Harness engineering: The discipline of designing and building the infrastructure that grounds models in tools, state, and safe execution. "forms the main focus of harness engineering"
Harness interface: The code-centric boundary where an agent’s outputs become executable interactions with the environment. "harness interface"
Harness mechanisms: The planning, memory, tool-use, control, and optimization modules that sustain long-horizon agent operation. "Harness Mechanisms: Planning, Memory, Tool Use, Control, and Optimization"
Permission boundaries: Enforced limits on what actions or resources an agent can access to ensure safety and governance. "permission boundaries"
Process rewards: Intermediate signals that score steps of a reasoning or execution trajectory, not just final outcomes. "execution traces, or process rewards"
Program-aided reasoning: Delegating parts of reasoning to executable programs that are run and checked by interpreters or solvers. "Program-aided reasoning methods externalize intermediate computation into executable code"
Proof assistant: A system supporting machine-checked proof development (e.g., Lean, Coq, Isabelle) used for formal reasoning. "Proof assistants such as Lean~\cite{moura2021lean}, Isabelle~\cite{nipkow2002isabelle}, and Coq~\cite{barras1999coq}"
Repository-level: Pertaining to whole software repositories, including multi-file context and project-wide dependencies. "repository-level software engineering"
Sandbox: An isolated execution environment that limits side effects and permissions during code runs. "sandboxes"
SAT/SMT solving: Satisfiability and satisfiability modulo theories solving used as machine-checkable reasoning backends. "SAT/SMT solving"
Static analysis: Program analysis without execution to detect issues, enforce constraints, or guide repair. "use static analysis, runtime errors, tests, and human feedback to revise code"
Symbolic solver: A solver that manipulates symbolic expressions and constraints (e.g., theorem provers, SMT solvers). "symbolic solvers"
Telemetry: Instrumentation data and logs collected during agent execution for monitoring and control. "telemetry"
Tool use: The agent’s invocation of external tools and APIs as part of its action space. "planning, memory, tool use, execution, and repair"
World-modeling: Building executable models of environment dynamics and state transitions to guide planning and control. "World-modeling"
Workflow orchestration: Structuring and scheduling multi-step processes, dependencies, and tool calls for long-horizon tasks. "trajectory search, or workflow orchestration"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

Tweets

HackerNews

Code as Agent Harness (3 points, 0 comments)

Code as Agent Harness

Summary

Code as Agent Harness: A Systems-Centric Paradigm for Executable, Verifiable, and Stateful AI Agents

Introduction: The Shift to Code-Centrism in Agentic Architectures

Harness Interface: Code as Reasoning Substrate, Action Boundary, and World Model

Harness Mechanisms: Planning, Memory, Tool Use, Feedback, and Engineering

Scaling the Harness: Multi-Agent Orchestration and Shared Substrates

Application Domains and Emerging Directions

Open Challenges and Research Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does it ask?

How do the authors study it?

What are the main ideas and why do they matter?

Layer 1: Code as the interface for thinking, acting, and seeing the world

Layer 2: The mechanisms that keep agents dependable

Layer 3: Scaling up to teams of agents

Where can this be used?

What are the main takeaways?

Why does this matter for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

HackerNews

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research