Papers
Topics
Authors
Recent
Search
2000 character limit reached

Auto-Eval Judge Framework

Updated 24 February 2026
  • Auto-Eval Judge is a modular, domain-agnostic evaluation framework that decomposes complex tasks into verifiable sub-goals with clear, traceable evidence.
  • It employs a four-stage pipeline to generate criteria, parse execution logs, classify task components, and aggregate verdicts into a final, interpretable decision.
  • Experimental validation reveals improved accuracy and precision over black-box LLM methods, supporting reproducible, low-labor evaluations across domains.

Auto-Eval Judge is a modular, domain-agnostic evaluation framework designed for systematic, human-aligned assessment of complex agentic systems capable of multi-step reasoning and tool interaction. Unlike black-box LLM-as-a-Judge protocols, which evaluate only final outputs, Auto-Eval Judge explicitly decomposes tasks into verifiable sub-goals, retrieves and validates supporting evidence for each step, and aggregates stepwise correctness assessments into a final, interpretable verdict. The architecture aligns computational evaluation with the standard working practice of expert human graders and enables transparent, low-labor evaluation across diverse domains—both for classic NLP tasks and more general agentic pipelines (Bhonsle et al., 7 Aug 2025).

1. Modular Architecture and Workflow

Auto-Eval Judge operates as a four-stage pipeline, each stage corresponding to a core aspect of expert evaluation:

  1. Criteria Generator
    • Consumes the original task description TT and produces an evaluation checklist Q={q1,...,qn}Q = \{q_1, ..., q_n\} of binary (yes/no) queries that individually cover all explicit task requirements. Each query addresses a non-overlapping sub-aspect, enforced by an LLM-driven redundancy filter.
  2. Artifact Content Parser
    • Indexes the Actor’s full execution log LL into fixed-size, summarized chunks. For every checklist query qiq_i, the Artifact Content Parser retrieves the most relevant chunk-summaries using an embedding cross-encoder, and an LLM extracts minimal, text-based “proof” snippets pip_i, yielding P={p1,...,pn}P = \{p_1, ..., p_n\}.
  3. Criteria Check Composer (C3)
    • Each checklist item (qiq_i, pip_i) pair is classified as factual, logical, or coding. Factual and coding queries are routed through a code-execution handler (e.g., Magnetic-One agent pipeline); logical queries require LLM inference. The system constructs a decision tree (“decision_pathi_i”) and issues a local correctness judgment ci{0,1}c_i \in \{0,1\} with an accompanying rationale rir_i.
  4. Verdict Generator
    • Aggregates local judgments {c1,...,cn}\{c_1, ..., c_n\} and rationales, and applies an LLM “reasoning pass” to produce a final binary verdict V{Yes,No}V \in \{\text{Yes}, \text{No}\} on overall task success or failure.

This architecture is modular: each stage’s design can be adapted for alternative document types, reasoning modalities, or verification toolchains as needed (Bhonsle et al., 7 Aug 2025).

2. Formal Definitions and Correctness Computation

The core evaluation procedure is underpinned by formal mappings:

  • Checklist generation: Q=CriteriaGenerator(T)Q = \text{CriteriaGenerator}(T).
  • Proof retrieval: For each qiq_i, pi=Retriever(Index(L),qi)p_i = \text{Retriever}(\text{Index}(L), q_i).
  • Step correctness: ci=C3(qi,pi){0,1}c_i = \text{C3}(q_i, p_i) \in \{0,1\}; equivalently, a verification function Vi:(qi,pi){0,1}V_i : (q_i, p_i) \mapsto \{0,1\}.
  • Interpretability: Each local judgment cic_i is paired with a natural-language rationale rir_i and an explicit decision path.

The modular formalism enables transparent task decomposition and supports both strict and thresholded success criteria.

3. Aggregation and Final Verdict Logic

The final verdict synthesis employs either conjunction or thresholded aggregation over step-level correctness:

  • Strict conjunction: Vfinal=1V_{\text{final}} = 1 if i, ci=1\forall i,\ c_i = 1; $0$ otherwise.
  • Thresholded sum with tolerance: Vfinal=1V_{\text{final}} = 1 if i=1ncinϵ\sum_{i=1}^n c_i \geq n - \epsilon for a small tolerance ϵ\epsilon.
  • Fractional threshold: Vfinal=1(1/n)iciτV_{\text{final}} = \mathbf{1}_{(1/n) \sum_i c_i \geq \tau}, for user-set τ\tau, typically 0.95.

Such aggregation approaches support both high-precision regimes requiring all sub-tasks to be correct, and more practical settings tolerant to marginal or optional errors.

4. Human-Like Criteria and Interpretability

Auto-Eval Judge emulates the methodology of expert human evaluation:

  • Explicit task decomposition: Converts loosely-specified agentic objectives into a closed checklist of atomic requirements.
  • Trace-based evidence collection: Locates explicit proof of task completion in agent execution logs, not relying on declared final outputs alone.
  • Semantic type classification: Differentiates between requirements necessitating factual lookup, code/tool execution, or logical inference.
  • Dynamic verification: Deploys specialized multi-agent or LLM pipelines depending on the sub-task type.
  • Transparent justifications: Generates human-readable rationales alongside decision paths for each local assessment, facilitating error analysis and external audit (Bhonsle et al., 7 Aug 2025).

5. Experimental Validation and Comparative Results

Auto-Eval Judge was validated on two benchmarks—GAIA (text-based agentic tasks) and BigCodeBench (program reasoning tasks):

Dataset Baseline (GPT-4o) Auto-Eval Judge Gain
GAIA Accuracy 57.14% 61.90% +4.76pp
BigCodeBench 63.16% 73.68% +10.52pp

On BigCodeBench, precision improves to 92.31% (Auto-Eval Judge) vs. 76.47% (baseline). Recall is slightly lower (92.30% vs. 100%), indicating a more conservative but precise framework. Specificity also rises by 4.51 points in GAIA. These results establish that systematic, step-wise trace verification yields stronger and more reliable alignment with expert annotation than output-only, black-box LLM judging (Bhonsle et al., 7 Aug 2025).

6. Limitations and Prospective Extensions

Current limitations include:

  • Restriction to text-only tasks and single log files, with no native support for images, audio, or multi-stream artifact evaluation.
  • Criteria Generator is limited to textual checklists and cannot parse structured artifacts or environmental byproducts.
  • Artifact Content Parser does not support real-time or environment-interactive agents.

Proposed extensions involve:

  • Environment Explorer modules to enable direct inspection of non-textual agent outputs.
  • Multi-modal Criteria Generator for images, video, or 3D scene validation.
  • Distributed, multi-log artifact indexing for more complex agent settings.

The framework is designed for extensibility, making it applicable to increasingly sophisticated agentic systems over time.


Auto-Eval Judge represents a significant advance for evaluation science in multi-step agentic systems by making the evidence chain explicit, decomposing complex queries into verifiable subcomponents, and aggregating those into a principled, human-aligned verdict. As such, it provides a foundation for unbiased, reproducible, and auditable evaluation practices in contemporary AI research and deployment (Bhonsle et al., 7 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Auto-Eval Judge.