- The paper introduces AutoPyVerifier, an automatic synthesis framework that generates compact, executable verifiers for LLM outputs using a directed acyclic graph search.
- It employs a two-stage LLM critique–refine cycle to iteratively optimize verifier sets, achieving up to +55.0 F1 improvement on benchmark tasks.
- The framework generalizes across models and applications, enhancing downstream inference accuracy by up to +17.0% while ensuring transparency and robustness.
AutoPyVerifier: Inducing Compact, Executable Verifiers for LLM Outputs
Motivation and Problem Definition
Verification is increasingly critical for LLM-centric systems, both at training time (e.g., RL fine-tuning, filter-based reward models) and inference (reranking, search, iterative self-correction). Conventionally, two verification paradigms manifest trade-offs: LLM-based judges, which are flexible but opaque and inconsistent, versus deterministic executable verifiers (e.g., Python functions), which are robust, transparent, and interpretable—yet usually hand-engineered, narrow, and brittle. The paper formulates the central question: Can compact sets of Python verifiers be automatically induced, from a development set of (task inputs, LLM outputs, objective labels), that robustly approximate arbitrary verification objectives such as correctness, admissibility, or task completion?
Method: The AutoPyVerifier Framework
AutoPyVerifier operationalizes verifier-set synthesis as a directed acyclic graph (DAG) search over the combinatorial space of candidate Python verifier sets. The framework consists of three principal algorithmic phases.
First, it prompts an LLM with the task description and labeled data to synthesize diverse, plausible Python verifier sets, conforming to a strict API contract (with explicit context extraction fields and deterministic logic). Each set is inserted as a DAG node.
DAG search is then performed: at each iteration, the node (verifier set) with the highest acquisition value is selected for expansion. The acquisition function integrates predictive quality (e.g., F1 with the objective label), exploration bonuses (UCB-style), compactness penalties, and feasibility heuristics.
Expansion proceeds via a two-stage LLM critique–refine cycle: a critic module diagnoses current verifier weaknesses using false-positive and false-negative examples, proposing structured modifications. A modifier then synthesizes revised candidate verifier sets (children nodes), via addition, removal, or update of functions or aggregation logic. The resulting children are inserted into the DAG; the search continues until a stopping condition is met.
The optimal verifier set is selected by joint ranking on predictive utility and size.
Figure 1: AutoPyVerifier system overview—DAG-based search, LLM-synthesized and refined Python verifiers, and objective-aligned selection pipeline.
Figure 2: DAG search mechanics—a node is expanded using LLM-guided critique and modification, supporting iterative, data-driven verifier refinement.
Experimental Results
AutoPyVerifier is evaluated on a suite of benchmarks targeting mathematical reasoning (AIME), code generation (LiveCodeBench), multi-step constrained function calling (ComplexFuncBench), and generalized instruction-following (IFBench), using several major LLMs (GPT-5.4, Gemini-3.1, Claude Haiku 4.5, etc.).
The induced verifier sets are consistently compact (1-6 functions) and yield dramatic improvements in verification accuracy: up to +55.0 F1 points over initialization (Table 1 in the original paper), with large gains especially on structurally complex benchmarks. Critical ablation analyses reveal that utility terms (exploration, compactness penalization, and feasibility) exhibit sharply task-dependent importance.
Figure 3: Linear regression ablation—task-specific coefficients for each utility term, indicating nontrivial dependencies between search hyperparameters and verification efficacy.
Cross-model transfer experiments demonstrate that AutoPyVerifier does not overfit generator idiosyncrasies: induced verifier sets generalize to LLM outputs from different models, often maintaining high predictive F1 or even outperforming naive LLM-only scoring baselines. Notably, some sets show moderate OOD degradation, exposing open questions around invariance in discovered logic.
Figure 4: Verification category taxonomy shift—search reduces superficial presence-based checks (e.g., content/entity) in favor of structural, internal consistency, and semantic proxies.
Qualitative analysis of initial versus final verifier sets, across all benchmarks, reveals a systematic shift: post-search sets contain fewer surface/presence checks and more logic capturing deep structure (e.g., answer–explanation agreement, contract satisfaction, execution-based feedbacks).
Impact on Downstream LLM Inference
Beyond verification, AutoPyVerifier-derived verifier sets are exposed as tools to LLMs at inference time. These externalized, modular correctness checks provide actionable signals—used in self-improvement or rejection pipelines. Empirically, plugging induced verifiers into system prompts for GPT-4.1 yields accuracy boosts up to +17.0% (on IFBench), confirming that these verifiers are interpretable and actionable by LLMs equipped for tool use.
Theoretical and Practical Implications
This work fundamentally reframes verifier design as an induction and search problem, not a static engineering task. AutoPyVerifier demonstrates the efficacy of leveraging LLMs for latent logic synthesis, but crucially filters and refines such logic through a DAG search that is task-objective centric and model-agnostic.
The compactness and executable nature of discovered verifier sets facilitates interpretability, transparent debugging, system stichability, and the possibility of robust OOD generalization. The framework is broadly applicable: reward modeling in RLHF, inference-time reranking and self-correction, LLM agentic tool use, etc. Extensions to multi-agent and multi-objective contexts, and integration with RL for joint optimization of generators and verifiers, are natural.
Conclusion
AutoPyVerifier represents an effective and modular pipeline for automatic induction of deterministic, objective-aligned Python verifiers for LLM outputs. The approach integrates LLM-based logic proposal with DAG-based search and principled utility optimization, resulting in compact, interpretable, and transferable verifier sets. Empirical results validate strong performance gains, cross-model generalization, and practical downstream utility. This work establishes data-driven, compositional verifier synthesis as a key axis for reliable, scalable LLM system construction.