Placement of Evaluation Scaffolding in the Harness

Determine whether generator–evaluator separation, sprint contracts, and post-hoc checks (building on Self-Refine) should be implemented inside the Claude Code harness (for example, as additional hook events) or outside it as a separate evaluation layer.

Background

The paper reports that silent failures are a dominant risk in agent systems and argues that closing the observability–evaluation gap likely requires additional scaffolding beyond model improvements. It then raises two architectural questions: the first concerns where evaluation scaffolding should live (inside the harness vs. an external layer), and the second concerns whether the existing hook pipeline can host such scaffolding within the current context budget.

This problem focuses on the first question—clarifying the locus of evaluation mechanisms relative to the agent harness and its hook system.

References

Against the permission pipeline and tool-orchestration layers analysed in \Cref{sec:auth,sec:turn}, two architectural questions remain open. First, whether the scaffolding the paper cites from \citet{anthropic2026harness} (generator--evaluator separation, sprint contracts, post-hoc checks, building on \citet{madaan2023selfrefine}'s self-refine pattern) belongs inside the harness (e.g., as additional hook events alongside the 27 documented in \Cref{sec:ext}) or outside it as a separate evaluation layer is not settled by the cited sources.

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems  (2604.14228 - Liu et al., 14 Apr 2026) in Section 12.1 (Silent Failure and the Observability–Evaluation Gap)