- The paper introduces a pointwise generative reward modeling approach augmented by Self-Principled Critique Tuning (SPCT) to enhance model flexibility and inference-time scaling.
- It employs parallel sampling with voting and a meta reward model to aggregate diverse critiques, significantly boosting the accuracy of response ranking.
- Empirical results show that a smaller SPCT-GRM-27B model, using inference-time scaling, can outperform much larger baselines, highlighting compute efficiency as key to performance improvement.
The challenge of developing accurate and broadly applicable reward models (RMs) for LLMs is central to advancing reinforcement learning from human feedback (RLHF) and related alignment techniques. Traditional RMs often struggle with flexibility across diverse input formats or lack mechanisms to effectively leverage increased computational resources at inference time. The research presented in "Inference-Time Scaling for Generalist Reward Modeling" (2504.02495) introduces methods to enhance generalist reward modeling by exploiting inference-time compute, proposing that effective learning strategies can unlock significant performance gains without solely relying on larger model sizes (training-time scaling).
Pointwise Generative Reward Modeling (GRM) for Flexibility and Scalability
To address the limitations of existing RM paradigms, the work adopts a pointwise Generative Reward Modeling (GRM) approach. Unlike pairwise RMs, which are typically restricted to comparing two responses, or scalar RMs, which produce a single deterministic score limiting inference-time enhancement, pointwise GRMs offer greater flexibility and inherent potential for scaling.
The core mechanism involves generating a textual critique (C
) conditioned on the input query (x
) and a set of candidate responses ({y_1, ..., y_n}
). From this generated critique, numerical scores (S_i
) are extracted for each response y_i
. This generative formulation allows the model to handle single, paired, or multiple responses within a unified framework. Crucially, the generative nature permits sampling multiple distinct critiques for the same input, forming the basis for inference-time scaling strategies. The quality of the generated critique, and thus the extracted scores, can be guided by principles (P
) relevant to the query and response quality, which can be provided as input or, more dynamically, generated by the model itself.
The format for the GRM output typically follows a structured pattern like:
1
2
3
4
5
6
7
|
Principles: <Generated or Provided Principles P>
Critique: <Generated Critique C comparing responses based on P>
Scores:
y_1: S_1
y_2: S_2
...
y_n: S_n |
where
S_i
are numerical values extracted from the critique text.
Self-Principled Critique Tuning (SPCT) for Enabling Scalable Behavior
A key contribution is the Self-Principled Critique Tuning (SPCT) learning method, designed specifically to train pointwise GRMs to exhibit behaviors that benefit from inference-time scaling. SPCT aims to make the generation of principles (P
) an adaptive part of the reward modeling process, rather than relying on static, predefined principles. This allows the RM to tailor its evaluation criteria dynamically based on the specific input context.
SPCT employs a two-phase training procedure:
- Rejective Fine-Tuning (RFT): This initial phase adapts a pre-trained LLM to the task of generating critiques and scores in the desired format. It uses supervised fine-tuning on trajectories potentially generated by an initial GRM. Rejection sampling is applied based on the correctness of the extracted scores against ground-truth preference data (e.g., ensuring the score for the preferred response is higher). This phase establishes the basic capability of generating structured critique-score outputs. Both hinted (using ground-truth information during sampling) and non-hinted rejection sampling can be employed, with non-hinted sampling found to be more effective empirically for fostering scalable behaviors.
- Rule-Based Online RL: Following RFT, the model is further fine-tuned using reinforcement learning, specifically employing GRPO (Generative Reward Policy Optimization) with rule-based rewards. In this phase, the model generates both the principles (
P
) and the critique (C
) for a given input (x
, {y_i}
). The reward signal is determined by whether the extracted scores (S_i
) correctly rank the responses according to ground-truth preferences. For example, if y_j
is preferred over y_k
in the ground truth, the model receives a positive reward if Sj>Sk. This online RL phase directly optimizes the model's ability to generate principles and critiques that lead to accurate reward signals, reinforcing the generation of evaluation logic that is consistent and useful.
This SPCT process results in DeepSeek-GRM models when applied to foundation models like Gemma-2 or DeepSeek variants. The SPCT training explicitly encourages the model to explore diverse reasoning paths (via principle generation) for evaluating responses, making it amenable to benefiting from multiple inference passes.
Inference-Time Scaling Mechanisms
To capitalize on the scalable behaviors instilled by SPCT, the paper proposes techniques to utilize increased compute during inference:
- Parallel Sampling and Voting: Instead of generating a single critique via greedy decoding,
k
independent samples of (principle, critique, score) tuples are generated in parallel for the same input query and responses. To enhance diversity, the order of responses {y_i}
can be shuffled for each sample generation. The final score for each response y_i
is then computed by aggregating the scores Si(j) obtained from each of the k
samples (e.g., summation: Sifinal=∑j=1kSi(j)). This approach allows the model to leverage different generated principles and evaluation perspectives, effectively broadening the reward signal computation and improving accuracy through ensemble-like effects. The computational cost scales linearly with k
.
- Meta Reward Modeling (Meta RM) Guided Voting: Generating
k
samples and aggregating them can be computationally expensive and may include low-quality or outlier critiques. To mitigate this, a meta RM is introduced. This is a separate, typically simpler, pointwise scalar RM trained to predict the quality or correctness of a full principle-critique generation produced by the main SPCT-GRM.
- Training Meta RM: The meta RM is trained on data consisting of (input, responses, generated principle-critique) tuples from the SPCT-GRM, labeled with whether the extracted scores correctly reflect ground-truth preferences.
- Inference with Meta RM: During inference, the SPCT-GRM generates
k
samples. The meta RM then scores each of these k
samples based on their predicted quality. Only the top k_{meta}
samples (where kmeta≤k) according to the meta RM scores are selected. The final response scores are then aggregated (voted) using only these filtered k_{meta}
samples. This guided voting aims to improve the efficiency and robustness of the inference-time scaling by focusing the aggregation on the most promising critique samples, potentially achieving better performance than simple voting with fewer total samples (k_{meta}
vs k
).
Empirical Validation and DeepSeek-GRM
The effectiveness of SPCT and inference-time scaling is demonstrated through extensive experiments using DeepSeek-GRM models, particularly SPCT-GRM-27B (based on Gemma-2-27B).
- Performance: SPCT-GRM significantly outperforms baseline methods (LLM-as-a-Judge, BT-RM, CLoud-RM) and existing RMs on benchmarks like Reward Bench, PPE, and RMB.
- Scalability vs. Model Size: A notable finding is that inference-time scaling applied to SPCT-GRM-27B allows it to achieve performance comparable to or exceeding much larger models (up to DeepSeek-V3 671B) evaluated with standard greedy decoding. This strongly suggests that investing compute at inference time can be more effective than solely increasing model size during training for improving reward modeling accuracy. For example, SPCT-GRM-27B with
k=32
samples significantly outperforms a 671B parameter baseline using greedy decoding. Meta RM guided voting further enhances this effect, achieving high accuracy with fewer effective samples (k_{meta}
).
- Reduced Bias: SPCT-GRM demonstrated more balanced performance across different domains compared to scalar or semi-scalar RMs, which often showed strong performance on verifiable tasks (like math) but struggled with general instructions or creative writing.
- Ablation Studies: These confirmed the positive contributions of key components of SPCT, including adaptive principle generation, the non-hinted rejective sampling strategy in RFT, the online RL phase, and the inclusion of general instruction data during training.
Implementation Considerations and Trade-offs
Implementing inference-time scaling for GRMs involves several practical considerations:
- Computational Cost: The primary trade-off is between performance gain and increased inference cost (latency and compute). Parallel sampling multiplies the cost by
k
, while Meta RM guided voting involves the cost of generating k
samples plus the cost of running the Meta RM, potentially offset by using a smaller k_{meta}
for voting. The choice of k
or k_{meta}
depends on the available compute budget and desired accuracy.
- Meta RM Training: Requires curating a dataset and training a separate model, adding complexity to the overall pipeline. The quality of the Meta RM directly impacts the effectiveness of guided voting.
- Sampling Parameters: Tuning sampling parameters like temperature for the GRM is crucial to ensure diversity and quality in the generated critiques.
- Score Extraction: Robust methods are needed to parse the generated text and reliably extract numerical scores for each response. Errors in extraction can negate the benefits of improved critique generation.
- Training Data: SPCT requires preference datasets (pairwise or listwise rankings) to train both the RFT and online RL phases effectively. The quality and diversity of this data are critical.
- System Complexity: Implementing parallel sampling, voting logic, and potentially a meta RM adds significant complexity compared to deploying a standard scalar RM with greedy decoding.
Conclusion
The research on inference-time scaling for generalist reward modeling demonstrates a viable path towards more accurate and robust reward signals for LLMs. By employing pointwise GRMs trained with SPCT, models become capable of generating adaptive principles and critiques, enabling effective performance improvements through increased inference-time compute via parallel sampling and meta RM guided voting. The empirical results strongly support the claim that inference-time scaling can be a highly effective alternative or complement to training-time scaling, allowing smaller models to achieve state-of-the-art RM performance. Implementing these techniques requires careful consideration of computational trade-offs and system complexity but offers a promising direction for developing advanced generalist reward systems.