Harness Engineering as Categorical Architecture

Published 12 May 2026 in cs.PL, cs.AI, and math.CT | (2605.12239v1)

Abstract: The agent harness, the system layer comprising prompts, tools, memory, and orchestration logic that surrounds the model, has emerged as the central engineering abstraction for LLMbased agents. Yet harness design remains ad hoc, with no formal theory governing composition, preservation of properties under compilation, or systematic comparison across frameworks. We show that the categorical Architecture triple (G, Know, Phi) from the ArchAgents framework provides exactly this formalization. The four pillars of agent externalization (Memory, Skills, Protocols, Harness Engineering) map onto the triple's components: Memory as coalgebraic state, Skills as operad-composed objects, Protocols as syntactic wiring G, and the full Harness as the Architecture itself. Structural guarantees-integrity gates, quality-based escalation, supported convergence checks-are Know-level certificates whose preservation is structural replay: our compiler checks identity and verifier replay, not output-layer correctness or model behavior. We validate this correspondence with a reference implementation featuring compiler functors targeting Swarms, DeerFlow, Ralph, Scion, and LangGraph: the four configuration compilers preserve three named certificate types by identity or replay, and LangGraph preserves the same certificates through its shared per-stage execution path. The LangGraph compiler creates one node per stage using the same per-stage method as the native runtime, providing LangGraph-native observability without reimplementing harness logic. An end-to-end escalation experiment with real LLM agents confirms that the quality-based escalation control path is model-parametric in this two-model, one-task experiment. The result positions categorical architecture as the formal theory behind harness engineering.

Abstract PDF Upgrade to Chat

Authors (1)

Bogdan Banu

Summary

The paper presents a categorical formalization of agent harness engineering by mapping the harness abstraction to an Architecture triple that preserves certificate-driven structural guarantees.
It introduces compiler functors that verify property preservation across diverse orchestration systems through automated certificate replay and mechanical evidence.
Empirical evaluations demonstrate model-parametric escalation control and highlight limitations, such as the format-discipline ceiling in 8B-class models.

Categorical Architecture as the Foundation for Harness Engineering

Introduction

The paper "Harness Engineering as Categorical Architecture" (2605.12239) presents a rigorous categorical formalization of agent harness engineering, articulating a structural and property-preserving paradigm that replaces prevailing ad hoc methodologies in the design of LLM-based agent systems. By connecting the LangChain-driven "harness" abstraction—comprising prompts, tools, memory, and orchestration logic—to the formal Architecture triple $(G, \mathrm{Know}, \Phi)$ from the ArchAgents category, the work offers a compositional, certificate-driven approach to specifying and compiling agent infrastructures. This essay examines the foundational mapping between agent externalization and categorical architecture, substantiates the structural guarantee preservation claims, discusses the operational validation, and considers the theoretical and practical implications for the design and portability of agent harnesses.

Theoretical Framework: Mapping Externalization to Categorical Architecture

Externalization Pillars and Their Categorical Roles

The paper anchors its thesis in a correspondence between four major "pillars" of agent externalization—Memory, Skills, Protocols, and Harness Engineering—and the components of the Architecture triple:

Memory is modeled as coalgebraic state, enabling reconstruction of agent knowledge at any temporal decision point through bi-temporal memory structures.
Skills correspond to operad-composed objects, supporting serial, parallel, and traced (feedback) composition with property preservation guarantees—formally realized via typed operads.
Protocols map to the syntactic wiring graph $G$ , specifying typechecked modular wiring, port-level integrity, and formal wiring optics (lenses, prisms, traversals) for information flow.
Harness Engineering is the categorical architecture itself: the entire triple $(G, \mathrm{Know}, \Phi)$ , where $\mathrm{Know}$ acts as the structural knowledge and certificate store.

Certificates and Structural Guarantees

Structural guarantees—such as integrity gates, escalation controls, and convergence checks—are formalized as certificates attached to $\mathrm{Know}$ , independent of any underlying model. Each certificate comprises a theorem, parameter binding, and mechanical evidence, all of which must persist invariantly through compilation and transformation of the harness into different orchestration frameworks.

Deployment Map Abstraction

The deployment map $\Phi$ functionally decouples architecture from the specific model assignment, qualifying the harness as model-parametric. This abstraction ensures that $\mathrm{Know}$ -level certificates endure independent of instantiation choices in $\Phi$ , guaranteeing that the harness's structural properties are robust under model or tool swaps.

Operationalization: Compilers, Certificate Preservation, and Skill Composition

Compiler Functors and Preservation Invariants

The implementation features compiler functors mapping categorical architectures to various orchestration systems (Swarms, DeerFlow, Ralph, Scion, LangGraph). Each compiler is subjected to:

Stage and edge preservation.
Explicit replay and mechanical verification of certificates (checking theorem identity, parameter preservation, and evidence replay).
Deployment map compatibility.

Fully automated certificate preservation is observed across all tested compilers for the supported certificate types—ensuring that certificates do not degrade or become invalid under framework translation.

LangGraph Execution and Observability

A notable contribution is the direct extraction of stage-level execution into LangGraph nodes, leveraging shared runtime methods (run_single_stage) to guarantee behavioral and certificate parity between native and graph-executed runs. Enhanced observability is thus achieved without code duplication, addressing both developer ergonomy and formal correctness.

Atomic Skills as Operad Objects

Drawing on empirical findings from joint skill training [Ma et al., 2026], atomic skills (e.g., localization, editing, testing, reproduction, review) serve as operad objects. Their composition via serial, parallel, and trace operations attests to the compatibility of categorical composition with task-specific non-interference requirements, reinforcing the closure-properties needed by the operad framework.

Empirical Evaluation

Escalation Experiment

Parameterizing the harness over two LLMs (Phi-3 Mini, Gemma 4), the authors demonstrate a model-parametric escalation control: a verifier component scores outputs, and a watcher escalates execution to a more capable model when quality falls below a harness-level threshold. This verifies that control logic and quality-based gating are effectively encoded and preserved at the $\mathrm{Know}$ -level, completely abstracted from concrete model internals.

SWE-bench-lite: Format-Discipline Ceiling

A stringent evaluation on SWE-bench-lite (10 Python bug-fix instances) demonstrates that, despite rigorous harness structuring and input grounding, the principal bottleneck is the 8B-class models' incapacity to emit format-correct unified diffs. Grounding with repository context eliminates path hallucinations, but the overwhelming failure mode remains invalid diff formatting, invariant to harness complexity or division into serial stages. Retry strategies—prompting with sanitizer-reason codes—fail to recover any submissions, indicating a hard format-discipline ceiling at the 8B scale, regardless of model training regime or harness decomposition.

Certificate Preservation Across Compilers

Empirical testing confirms 100% identity and replayability of three distinct certificate types (priority gating, quorum sensing, mTOR convergence) across all compilers, substantiating the core claim of property-preserving transformation under categorical compilation.

Implications and Limitations

Implications for AI Agent Engineering

Formal Design Language: The mapping enables engineering harnesses with precise, formally specified and verifiable properties.
Portability: Harness behavior and guarantees are invariantly transportable across orchestration paradigms by construction.
Type-Safe Composition: Operad-guided skill composition enables safe construction of complex workflows using reusable atomic primitives.
Parameterization and Model-Agnosticism: System-level guarantees centralize at the harness, not the model tier, supporting agile deployment and evolution as model capabilities change.

Limitations and Open Questions

Static Snapshot: The formalism addresses architecture at a fixed point in time; adaptation and learning (dynamic harness evolution) remain unmodeled.
Limited Certificate Scope: Behavioral certificates (e.g., hallucination control) are not yet formalized.
Single Implementation Dependency: Generalization beyond the presented reference implementation (Operon) awaits further independent integrations.
Task Limitations at Model Scale: No evidence for improvements in complex task resolution (SWE-bench-lite) is offered at the current model size; harness improvements are only meaningful when core model capabilities suffice.

Conclusion

The categorical architecture framework fundamentally recasts harness engineering from an implementation-driven art to a mathematically grounded science. By identifying harnesses with Architecture triples and targeting structural guarantees as replayable, certificate-driven invariants, the approach opens a new regime of property-driven, composable, and portable agent engineering. The practical imperative is clear: harnesses should be designed with structural certificates as first-class objects, transformations verified for property preservation, and compositions structured by operad-theoretic compatibility. While the model remains the limiting agent capability at small scale, the categorical framework equips practitioners to reason about, compare, and deploy harnesses with predictable properties as foundation models continue to scale.

Reference: "Harness Engineering as Categorical Architecture" (2605.12239)

Markdown Report Issue