Generative Verifiers: Reward Modeling as Next-Token Prediction (2408.15240v2)

Published 27 Aug 2024 in cs.LG

Abstract: Verifiers or reward models are often used to enhance the reasoning performance of LLMs. A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional test-time compute via majority voting for better verification. We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge, resulting in a 16-40% improvement in the number of problems solved with Best-of-N on algorithmic and math reasoning tasks. Furthermore, we find that training GenRM with synthetic verification rationales is sufficient to pick out subtle errors on math problems. Finally, we demonstrate that generative verifiers scale favorably with model size and inference-time compute.

PDF HTML Abstract

Generative Verifiers: Reward Modeling as Next-Token Prediction

Introduction

The paper "Generative Verifiers: Reward Modeling as Next-Token Prediction" by Zhang, Hosseini, Bansal, Kazemi, Kumar, and Agarwal presents a novel framework for reward modeling in the context of LLMs. The primary contribution is the introduction of Generative Verifiers (GVs) which pivot the verification task from traditional discriminative approaches to utilizing next-token prediction. This approach leverages the inherent generation capabilities of LLMs, offering several key advantages such as instruction tuning, chain-of-thought reasoning, and optimized performance via majority voting.

Traditional Approaches and Limitations

Historically, reward models or verifiers have been trained as discriminative classifiers to assign scores to candidate solutions generated by LLMs. While effective to an extent, this approach does not capitalize on the generative nature of pretrained LLMs. These models often miss out on capabilities like instruction tuning and chain-of-thought reasoning, leading to suboptimal performance.

Generative Verifiers: Methodology

The authors propose training GVs using the next-token prediction objective. This shift allows the verifier to produce a token (e.g., 'Yes' or 'No') corresponding to the correctness of a solution within a given context. The GV framework incorporates several components:

Direct Verifier: Produces a correctness token directly using next-token probabilities.
Unified Generation and Verification: GVs are trained simultaneously on the tasks of generating solutions and verifying them, facilitating knowledge transfer between these tasks.
Chain-of-Thought (CoT) Verifiers: Uses detailed CoT rationales to reason about the solution before making a verification decision.
Majority Voting: To boost the verification accuracy, multiple CoT rationales are sampled, and the correctness score is averaged, making use of additional inference-time computation.

Experimental Results

The experimental results span several reasoning tasks including algorithmic string manipulation (Last Letter Concatenation and Word Sorting) and grade-school math problems (GSM8K). On these tasks, GVs, particularly those utilizing CoT reasoning with majority voting, significantly outperform conventional discriminative verifiers.

The results demonstrate:

Best-of-N Performance: GVs showed a substantial improvement in Best-of-N performance across all tasks when compared to discriminative verifiers. On GSM8K, a Gemma-9B CoT-based GV demonstrated a $20\%$ improvement over the baseline, nearly matching oracle verifier performance on algorithmic tasks.
Unified Training Benefits: Integrating the generation and verification training tasks improved the overall effectiveness of the GVs. This dual training goal leads to better verification and generation performance via positive knowledge transfer.
Scalability: Both model capacity and dataset size positively impact GV performance. Using larger models and more extensive training datasets led to consistent performance enhancements.
Synthetic Rationales: For GSM8K, training CoT verifiers with LLM-generated rationales (obtained via reference-guided grading) proved effective. These synthetic rationales, despite being noisier, sufficiently trained the GVs to outperform discriminative models.

Implications and Future Directions

The shift towards using next-token prediction for verification opens multiple avenues for future research. This approach not only enhances current verification models but also paves the way for integrating generative verification into broader AI systems. Here are key implications and directions:

Broad Task Applicability: Extending GVs to other domains such as code verification, alignment tasks, and various open-ended response evaluations.
Enhanced Self-Improvement Algorithms: GVs can be integrated into existing LLM self-improvement frameworks, leveraging their unified approach for better performance in iterative tasks.
Tool Integration: Applying advanced techniques like retrieval-augmented generation, multi-stage prompting, and external tool use within the GV framework to further enhance verification robustness.

By framing reward modeling as next-token prediction, the authors present a compelling case for rethinking how LLMs are utilized for verification tasks. The demonstrated improvements in accuracy and scalability highlight the potential of GVs to significantly impact the domain of automated reasoning and beyond.