Generative Verifiers: Reward Modeling as Next-Token Prediction
Introduction
The paper "Generative Verifiers: Reward Modeling as Next-Token Prediction" by Zhang, Hosseini, Bansal, Kazemi, Kumar, and Agarwal presents a novel framework for reward modeling in the context of LLMs. The primary contribution is the introduction of Generative Verifiers (GVs) which pivot the verification task from traditional discriminative approaches to utilizing next-token prediction. This approach leverages the inherent generation capabilities of LLMs, offering several key advantages such as instruction tuning, chain-of-thought reasoning, and optimized performance via majority voting.
Traditional Approaches and Limitations
Historically, reward models or verifiers have been trained as discriminative classifiers to assign scores to candidate solutions generated by LLMs. While effective to an extent, this approach does not capitalize on the generative nature of pretrained LLMs. These models often miss out on capabilities like instruction tuning and chain-of-thought reasoning, leading to suboptimal performance.
Generative Verifiers: Methodology
The authors propose training GVs using the next-token prediction objective. This shift allows the verifier to produce a token (e.g., 'Yes' or 'No') corresponding to the correctness of a solution within a given context. The GV framework incorporates several components:
- Direct Verifier: Produces a correctness token directly using next-token probabilities.
- Unified Generation and Verification: GVs are trained simultaneously on the tasks of generating solutions and verifying them, facilitating knowledge transfer between these tasks.
- Chain-of-Thought (CoT) Verifiers: Uses detailed CoT rationales to reason about the solution before making a verification decision.
- Majority Voting: To boost the verification accuracy, multiple CoT rationales are sampled, and the correctness score is averaged, making use of additional inference-time computation.
Experimental Results
The experimental results span several reasoning tasks including algorithmic string manipulation (Last Letter Concatenation and Word Sorting) and grade-school math problems (GSM8K). On these tasks, GVs, particularly those utilizing CoT reasoning with majority voting, significantly outperform conventional discriminative verifiers.
The results demonstrate:
- Best-of-N Performance: GVs showed a substantial improvement in Best-of-N performance across all tasks when compared to discriminative verifiers. On GSM8K, a Gemma-9B CoT-based GV demonstrated a improvement over the baseline, nearly matching oracle verifier performance on algorithmic tasks.
- Unified Training Benefits: Integrating the generation and verification training tasks improved the overall effectiveness of the GVs. This dual training goal leads to better verification and generation performance via positive knowledge transfer.
- Scalability: Both model capacity and dataset size positively impact GV performance. Using larger models and more extensive training datasets led to consistent performance enhancements.
- Synthetic Rationales: For GSM8K, training CoT verifiers with LLM-generated rationales (obtained via reference-guided grading) proved effective. These synthetic rationales, despite being noisier, sufficiently trained the GVs to outperform discriminative models.
Implications and Future Directions
The shift towards using next-token prediction for verification opens multiple avenues for future research. This approach not only enhances current verification models but also paves the way for integrating generative verification into broader AI systems. Here are key implications and directions:
- Broad Task Applicability: Extending GVs to other domains such as code verification, alignment tasks, and various open-ended response evaluations.
- Enhanced Self-Improvement Algorithms: GVs can be integrated into existing LLM self-improvement frameworks, leveraging their unified approach for better performance in iterative tasks.
- Tool Integration: Applying advanced techniques like retrieval-augmented generation, multi-stage prompting, and external tool use within the GV framework to further enhance verification robustness.
By framing reward modeling as next-token prediction, the authors present a compelling case for rethinking how LLMs are utilized for verification tasks. The demonstrated improvements in accuracy and scalability highlight the potential of GVs to significantly impact the domain of automated reasoning and beyond.