AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

Published 24 Apr 2026 in cs.CL, cs.LG, and cs.PL | (2604.22937v1)

Abstract: Verification is becoming central to both reinforcement-learning-based training and inference-time control of LLMs. Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable verifiers and selects a compact verifier set whose joint satisfaction best approximates the target objective. Across mathematical reasoning, coding, function calling, and instruction-following benchmarks for several state-of-the-art LLMs, AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points over the initial LLM-generated verifier sets. Additional analyses show that the most useful verification targets vary by benchmark and model, and that the DAG-based search shifts the learned verifier sets toward more structural and semantically grounded checks. We further show that exposing the discovered verifier set to an LLM as an external tool improves downstream accuracy by up to 17.0 points. We release our code

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces AutoPyVerifier, an automatic synthesis framework that generates compact, executable verifiers for LLM outputs using a directed acyclic graph search.
It employs a two-stage LLM critique–refine cycle to iteratively optimize verifier sets, achieving up to +55.0 F1 improvement on benchmark tasks.
The framework generalizes across models and applications, enhancing downstream inference accuracy by up to +17.0% while ensuring transparency and robustness.

AutoPyVerifier: Inducing Compact, Executable Verifiers for LLM Outputs

Motivation and Problem Definition

Verification is increasingly critical for LLM-centric systems, both at training time (e.g., RL fine-tuning, filter-based reward models) and inference (reranking, search, iterative self-correction). Conventionally, two verification paradigms manifest trade-offs: LLM-based judges, which are flexible but opaque and inconsistent, versus deterministic executable verifiers (e.g., Python functions), which are robust, transparent, and interpretable—yet usually hand-engineered, narrow, and brittle. The paper formulates the central question: Can compact sets of Python verifiers be automatically induced, from a development set of (task inputs, LLM outputs, objective labels), that robustly approximate arbitrary verification objectives such as correctness, admissibility, or task completion?

Method: The AutoPyVerifier Framework

AutoPyVerifier operationalizes verifier-set synthesis as a directed acyclic graph (DAG) search over the combinatorial space of candidate Python verifier sets. The framework consists of three principal algorithmic phases.

First, it prompts an LLM with the task description and labeled data to synthesize diverse, plausible Python verifier sets, conforming to a strict API contract (with explicit context extraction fields and deterministic logic). Each set is inserted as a DAG node.

DAG search is then performed: at each iteration, the node (verifier set) with the highest acquisition value is selected for expansion. The acquisition function integrates predictive quality (e.g., F1 with the objective label), exploration bonuses (UCB-style), compactness penalties, and feasibility heuristics.

Expansion proceeds via a two-stage LLM critique–refine cycle: a critic module diagnoses current verifier weaknesses using false-positive and false-negative examples, proposing structured modifications. A modifier then synthesizes revised candidate verifier sets (children nodes), via addition, removal, or update of functions or aggregation logic. The resulting children are inserted into the DAG; the search continues until a stopping condition is met.

The optimal verifier set is selected by joint ranking on predictive utility and size.

Figure 1: AutoPyVerifier system overview—DAG-based search, LLM-synthesized and refined Python verifiers, and objective-aligned selection pipeline.

Figure 2: DAG search mechanics—a node is expanded using LLM-guided critique and modification, supporting iterative, data-driven verifier refinement.

Experimental Results

AutoPyVerifier is evaluated on a suite of benchmarks targeting mathematical reasoning (AIME), code generation (LiveCodeBench), multi-step constrained function calling (ComplexFuncBench), and generalized instruction-following (IFBench), using several major LLMs (GPT-5.4, Gemini-3.1, Claude Haiku 4.5, etc.).

The induced verifier sets are consistently compact (1-6 functions) and yield dramatic improvements in verification accuracy: up to +55.0 F1 points over initialization (Table 1 in the original paper), with large gains especially on structurally complex benchmarks. Critical ablation analyses reveal that utility terms (exploration, compactness penalization, and feasibility) exhibit sharply task-dependent importance.

Figure 3: Linear regression ablation—task-specific coefficients for each utility term, indicating nontrivial dependencies between search hyperparameters and verification efficacy.

Cross-model transfer experiments demonstrate that AutoPyVerifier does not overfit generator idiosyncrasies: induced verifier sets generalize to LLM outputs from different models, often maintaining high predictive F1 or even outperforming naive LLM-only scoring baselines. Notably, some sets show moderate OOD degradation, exposing open questions around invariance in discovered logic.

Figure 4: Verification category taxonomy shift—search reduces superficial presence-based checks (e.g., content/entity) in favor of structural, internal consistency, and semantic proxies.

Qualitative analysis of initial versus final verifier sets, across all benchmarks, reveals a systematic shift: post-search sets contain fewer surface/presence checks and more logic capturing deep structure (e.g., answer–explanation agreement, contract satisfaction, execution-based feedbacks).

Impact on Downstream LLM Inference

Beyond verification, AutoPyVerifier-derived verifier sets are exposed as tools to LLMs at inference time. These externalized, modular correctness checks provide actionable signals—used in self-improvement or rejection pipelines. Empirically, plugging induced verifiers into system prompts for GPT-4.1 yields accuracy boosts up to +17.0% (on IFBench), confirming that these verifiers are interpretable and actionable by LLMs equipped for tool use.

Theoretical and Practical Implications

This work fundamentally reframes verifier design as an induction and search problem, not a static engineering task. AutoPyVerifier demonstrates the efficacy of leveraging LLMs for latent logic synthesis, but crucially filters and refines such logic through a DAG search that is task-objective centric and model-agnostic.

The compactness and executable nature of discovered verifier sets facilitates interpretability, transparent debugging, system stichability, and the possibility of robust OOD generalization. The framework is broadly applicable: reward modeling in RLHF, inference-time reranking and self-correction, LLM agentic tool use, etc. Extensions to multi-agent and multi-objective contexts, and integration with RL for joint optimization of generators and verifiers, are natural.

Conclusion

AutoPyVerifier represents an effective and modular pipeline for automatic induction of deterministic, objective-aligned Python verifiers for LLM outputs. The approach integrates LLM-based logic proposal with DAG-based search and principled utility optimization, resulting in compact, interpretable, and transferable verifier sets. Empirical results validate strong performance gains, cross-model generalization, and practical downstream utility. This work establishes data-driven, compositional verifier synthesis as a key axis for reliable, scalable LLM system construction.

Markdown Report Issue