Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier (2505.11966v1)

Published 17 May 2025 in cs.AI

Abstract: LLM reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.

Summary

The paper presents a novel Solve-Detect-Verify pipeline with FlexiVe that dynamically switches between fast and slow thinking to verify LLM reasoning traces.
The approach integrates a Detector module to monitor hesitation cues and a reinforcement-trained verifier to pinpoint errors without excessive computation.
Experimental results on math benchmarks demonstrate improved error detection F1 scores and reduced token cost compared to traditional multi-sample verification methods.

The paper presents a novel methodology to improve the trade-off between reasoning accuracy and computational efficiency for LLMs on complex tasks. It introduces two main contributions:

A flexible, dynamic generative verifier—FlexiVe—that adaptively balances “fast thinking” and “slow thinking” modes to check the correctness of a solution’s reasoning trace without incurring the full high computational cost every time.
The Solve-Detect-Verify pipeline, an inference-time scaling framework that strategically integrates FlexiVe to monitor, verify, and refine candidate solutions generated by the LLM.

The approach starts with a standard LLM (“solver”) that generates a candidate solution along with a step-by-step reasoning trace. Instead of naively relying on lengthy verification procedures, the system first employs a Detector module. This module continuously monitors the ongoing solution generation by watching for hesitation keywords or cues that signal solution completeness. When such indicators are identified, the pipeline temporarily pauses to assess whether the reasoning is complete.

Verification is performed by FlexiVe, a generative verifier that applies a two-tier strategy. First, it uses inexpensive “fast thinking” runs that rapidly assess the entire reasoning trace for errors. If a high consensus is achieved (measured by an agreement ratio surpassing a preset threshold), the fast verification result is accepted. Otherwise, the process escalates to more expensive “slow thinking” runs for deeper, point-by-point error diagnosis. The dynamic allocation of verification budget is key—FlexiVe uses only as much compute as required for reliable error detection.

FlexiVe is trained using reinforcement learning via Group Relative Policy Optimization (GRPO). Its training objective encourages precise identification of the index of the first error within the reasoning trace while penalizing unnecessary token generation. The paper provides a detailed description of the reward structure, where correctness and response length are both optimized. With RL training, FlexiVe is shown to generalize well even when trained on relatively few samples compared to baseline models trained via supervised fine-tuning.

The Solve-Detect-Verify pipeline has three sequential stages:

Solve: The base LLM generates a candidate solution step by step.
Detect: During generation, the output is continuously monitored for specific cues (e.g., hesitation phrases) that may indicate either the presence of an implicit final answer or overthinking. At these points, a lightweight assessment (based on log probabilities) is used to decide whether to pause generation and trigger verification.
Verify and Refine: FlexiVe is engaged to evaluate the candidate tracing the error location. If the verifier finds an error, its feedback is used by the solver to generate an alternative, refined solution. In some cases, multiple candidate solutions are generated (best-of-N strategy) to leverage verification feedback and majority voting.

The paper evaluated FlexiVe and the full pipeline on several challenging mathematical reasoning benchmarks (such as AIME 2024, AIME 2025, CNMO, and ProcessBench datasets including GSM8K, MATH, OlympiadBench, and OmniMATH). Experimental results show that FlexiVe achieves higher error detection F1 scores while drastically reducing token generation cost compared to naive multi-sample or stepwise verification baselines. In particular, the “slow thinking” (deliberative) mode of FlexiVe, while more compute‐intensive, yields better accuracy than similar settings augmented with code-execution. Furthermore, the integrated Solve-Detect-Verify framework demonstrates substantial gains in both reasoning accuracy and inference efficiency versus standard approaches like self-consistency.

The paper also discusses trade-offs observed during experiments. While the “NoThinking” variant of FlexiVe uses far fewer tokens, it typically results in lower accuracy, especially on more complex problems. The adaptive strategy—starting with fast verification and only escalating when necessary—helps navigate this trade-off. In addition, the scaling experiments illustrate that combining increased solver compute (via generating multiple candidate solutions) with FlexiVe’s verification leads to high performance improvements.

Finally, the authors acknowledge limitations in terms of verification generalization across domains beyond mathematical reasoning (such as program synthesis or commonsense question answering) and the need for further optimization of hyperparameters that control the adaptive verification budget. They also suggest that future work should investigate optimized implementations (e.g., using advanced inference engines) to mitigate the overhead from dynamic mode switching.

Overall, the paper provides a comprehensive framework for dynamically adapting verification at inference time. By flexibly balancing computational resources, it offers a scalable and effective approach to improve the reliability of LLM reasoning in practical, resource-constrained deployment settings.