Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Shrinking the Generation-Verification Gap with Weak Verifiers (2506.18203v1)

Published 22 Jun 2025 in cs.CR and cs.CL

Abstract: Verifiers can improve LLM capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.

Summary

Shrinking the Generation-Verification Gap with Weak Verifiers

The paper, "Shrinking the Generation-Verification Gap with Weak Verifiers," introduces an innovative framework known as Weaver, which aims to enhance the verification process in LLMs (LMs) by aggregating multiple weak verifiers. In the current landscape, verifiers play a crucial role in scoring and ranking responses, helping to improve the capabilities of LMs. However, there exists a substantial gap between the performance of general-purpose verifiers and oracle verifiers—those with perfect accuracy. Weaver seeks to mitigate this gap by effectively combining numerous weak verifiers through weighted ensembles, utilizing weak supervision to estimate each verifier's accuracy without heavy reliance on labeled data.

The crux of the Weaver framework lies in its method of weighting ensemble scores from multiple verifiers to produce a more reliable verification signal. The paper demonstrates that using weighted combinations of verifier scores can significantly outperform simple averaging, as the latter assumes uniform verifier quality—a notion that does not hold true given the varied performance among verifiers. To address challenges such as inconsistent verifier output formats and handling low-quality verifiers, Weaver employs a strategic filtration and normalization approach. This prepares the diverse outputs for aggregation, allowing the ensemble to leverage complementary strengths while suppressing noise and reducing false positives.

Empirical evaluations reveal that Weaver considerably narrows the generation-verification gap with oracle metrics across reasoning and math tasks, achieving an average success rate of 87.7% with Llama 3.3 70B Instruct as the generator. This performance is comparable to frontier models such as OpenAI's o3-mini, yet attained without extensive finetuning or post-training interventions. The paper also explores the scalability of Weaver in terms of repeated sampling, model size, and computational budget, finding that naive verification strategies quickly plateau, while Weaver continues to yield gains by optimizing verifier weights through weak supervision.

A notable element of Weaver is its ability to retain performance while substantially reducing computational overhead. Through distillation, Weaver trains a compact 400M cross-encoder model to capture the ensemble's verification strategy, retaining 98.7% of full accuracy whilst reducing compute costs by up to 99.97%. This efficiency offers a pathway for scalable verification that maintains accuracy without the need for extensive computational resources.

The implications of this research are manifold. Theoretically, Weaver highlights the potential of unsupervised learning methods like weak supervision to enhance verification reliability, and opens up avenues for further exploration into ensemble methods in AI settings where training data is scarce. Practically, the framework paves the way for improved data filtering, model alignment, and inference-time decision-making, making it a valuable tool for deploying LMs at scale.

Future developments in AI could see Weaver's principles adapted to broader multimodal tasks, involving verification across additional data types like images or audio. Furthermore, the refinement of ensemble techniques could extend to the development of specialized verifier architectures tailored to specific domains, such as mathematical reasoning or code execution. As the landscape of open-source LMs and reward models continues to expand, Weaver positions itself as a cornerstone framework for navigating the complexities of verifier aggregation and enabling more robust LM performance without prohibitive computational expenditures.