PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament (2501.13007v2)

Published 22 Jan 2025 in cs.CL

Abstract: Best-of-N (BoN) sampling, a common strategy for test-time scaling of LLMs, relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Judge Reward Model (PariJudge RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, PariJudge RM judges two candidate solutions' correctness with chain-of-thought reasoning simultaneously. This approach eliminates the need for scoring and enables cross-validation of solutions through parallel judgment. In the knockout tournament, PariJudge RM conducts pairwise Judgment between candidate solutions and eliminates the incorrect ones iteratively. We construct PairJudge-432K, a large-scale dataset of 432K pairwise judgments derived from NumiaMath and annotated using \texttt{gemini-1.5-flash}, and train the PariJudge RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over baseline reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.

Summary

The paper introduces a novel Pairwise Reward Model that evaluates candidate solutions in pairs to overcome inconsistent absolute scoring.
It employs a knockout tournament for Best-of-N sampling, boosting performance by 40% to 60% on challenging math tasks.
The approach offers a more reliable selection mechanism with potential applications in diverse AI reasoning domains beyond math.

Overview of Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament

The paper presents a novel approach to enhance the selection process of candidate solutions generated by LLMs during test-time evaluation. It introduces a Pairwise Reward Model (Pairwise RM) used in conjunction with a knockout tournament strategy for Best-of-N (BoN) sampling. This proposed method aims to improve upon traditional reward models, which often exhibit inconsistencies when assigning absolute scores to candidate solutions.

The authors identify challenges associated with conventional outcome and process reward models, which typically offer inconsistent and arbitrary scoring. By contrast, the Pairwise RM assesses pairs of candidate solutions on a relative scale rather than assigning absolute scores. This methodology allows for the simultaneous assessment of the correctness of two solutions, avoiding the pitfalls of erroneous absolute scoring.

Construction and Dataset

The authors constructed a comprehensive dataset, Pairwise-443K, which includes 443,000 pairwise comparisons derived from the NumiaMath dataset. These comparisons were annotated using gemini-1.5-flash. The dataset is employed to fine-tune the Pairwise RM via supervised learning. The aim is to train the model to reliably judge the relative correctness of two math problem solutions, improving the ability to differentiate between correct and incorrect solutions.

Experimentation and Results

Extensive experiments are conducted using the MATH-500 dataset and the Olympiad Bench to demonstrate the effectiveness of the Pairwise RM against traditional reward models. The Pairwise RM achieved a substantial performance increase, demonstrating a 40% to 60% improvement on the top 50% of the most challenging problems in the MATH-500 dataset. This highlights the model's capacity to effectively discern correct solutions, even among difficult math reasoning tasks.

The Pairwise RM is also shown to outperform the Critic Model, another recent approach that critiques candidate solutions individually. The Pairwise RM provides more consistent and reliable assessments by evaluating solutions in pairs, thus handling the evaluation of math problems more robustly.

Implications and Future Directions

The primary contribution of the paper lies in establishing a system that avoids the common pitfalls of arbitrary scoring by traditional reward models, thus providing a more reliable selection mechanism under BoN sampling. This method holds significant potential for generalization across other reasoning domains beyond mathematical tasks.

The paper forecasts future advancements in AI, particularly in computational reasoning fields where model evaluation plays a crucial role. By applying the knockout tournament strategy combined with Pairwise RM, similar improvements may be implemented across different types of reasoning tasks, potentially extending to domains such as code generation, scientific problem-solving, and more.

Further research on optimizing Pairwise RM's inference time and exploring alternative tournament designs is suggested to enhance the efficiency and effectiveness of the model, making it more suitable for a broader range of applications. Additionally, examining alternatives to the conventional knockout approach, such as double-elimination or Swiss-style systems, could offer interesting avenues for future exploration.

In summary, this work provides significant insights into improving candidate solution selection for LLMs by introducing a new system that contrasts with traditional models, showcasing enhanced interpretability, consistency, and performance.

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1884043120461111769