- The paper demonstrates that scaling laws apply to preference modeling by showing power-law improvements in adversarial and objective tasks with larger models.
- The methodology uses a Bradley-Terry loss on 15M StackExchange pairs, revealing a pivotal gradient spike at 12.6M samples during training.
- WorldPM serves as a strong initialization for RLHF fine-tuning, significantly boosting alignment performance and generalization across benchmarks.
The paper "WorldPM: Scaling Human Preference Modeling" (2505.10527) investigates whether scaling laws observed in LLMing also apply to preference modeling. The authors propose "World Preference Modeling" (WorldPM) as a paradigm to capture a unified representation of human preferences by leveraging large-scale, naturally occurring preference data from public forums.
Core Idea and Motivation:
The research is motivated by the success of scaling laws in LLMs, where performance improves predictably with increased model size, data size, and compute. The authors hypothesize that similar scaling properties exist for preference modeling. A key challenge is the scarcity and high cost of manually annotated preference data. To address this, they propose using large-scale preference signals available in public forums like StackExchange, Reddit, and Quora, which aggregate user opinions through voting mechanisms.
Data Collection and Preparation:
Preference data is collected from these forums by sampling pairs of responses to a given post (prompt) based on their net votes (upvotes minus downvotes). The response with higher net votes is treated as the preferred response. After analyzing data quality across sources, StackExchange is selected as the primary source due to its higher quality and better generalization capabilities, even approaching or surpassing existing open-source preference models. The dataset comprises approximately 15 million preference pairs from StackExchange. Detailed analysis reveals that preference signals from StackExchange generalize well across different topics (e.g., Math StackExchange vs. StackOverflow), suggesting domain-agnostic human preference patterns.
Modeling World Preference:
WorldPM is implemented using the standard preference modeling framework, training a reward model (RM) on the collected pairwise comparison data using the Bradley-Terry (BT) loss objective:
LBT=−E(x,y0,y1,Y)∼D[logP(Y∣x,y0,y1)]
where P(Y=0∣x,y0,y1)=sigmoid(rθ(x,y0)−rθ(x,y1)).
Experiments are conducted using Qwen2.5 base models ranging from 1.5B to 72B parameters, trained on the 15M StackExchange dataset. Consistent hyperparameters are used, with a batch size of 10K and a learning rate of 3e-6. The authors observe a distinct "moment of epiphany" during large-scale training around 12.6M samples, characterized by a sudden loss drop and gradient spike, suggesting the model discovers a more general preference representation.
Evaluation Methods and Scaling Trends:
WorldPM models are evaluated on a diverse suite of RM benchmarks (PPE, RMB, RM-Bench, RewardBench, Offset Bias, HelpSteer2). These benchmarks are broadly categorized based on the capabilities they assess: adversarial (identifying subtle flaws), objective (knowledge with ground truth answers), and subjective (human/AI subjective preferences). Evaluation is performed using BT loss on test sets.
Key findings regarding scaling trends include:
- Adversarial Metrics: Test losses consistently decrease following a power law with increasing training data and model size. This indicates improved ability to detect intentional errors, irrelevant, or incomplete responses, suggesting large-scale training helps mitigate vulnerabilities.
- Objective Metrics: An emergent scaling phenomenon is observed. Larger models (72B) show consistent power law reduction in test losses across various objective tasks (coding, math, QA, instruction following), while smaller models show limited or no improvement. This suggests that preference modeling for objective knowledge is a challenging task that benefits significantly from increased model scale.
- Subjective Metrics: No clear scaling trends are observed. Test losses quickly converge or even increase with increased training data and model size.
Style Impact Analysis on Subjective Evaluation:
The lack of scaling in subjective metrics is hypothesized to be partly due to conflicts between WorldPM's learned preferences and biases present in subjective evaluation datasets. The authors investigate style preference, a quantifiable aspect known to influence LLM evaluation. They propose a method to separate style and content evaluation by linearly combining the model's score difference (D) and a computed style difference (Z) based on features like length and markdown: R=DTα+ZTβ. Optimizing α and β minimizes the BT loss on the evaluation set.
Analysis shows that:
- Subjective evaluations without careful annotation (crowdsourced, AI annotated) are highly sensitive to style factors, while well-controlled datasets (expert annotated) are more stable.
- The gap between style-controlled and uncontrolled evaluation performance widens with increased WorldPM training scale and model size, as WorldPM gradually reduces its style preference.
- During training, models initially over-rely on stylistic features (especially length), but this dependence decreases with scaling, although it remains higher than the inherent correlation between human labels and style. An asymmetric learning dynamic is observed: models quickly learn majority-style preferences but spend more time learning from minority-style instances.
WorldPM as a Foundation for Preference Fine-Tuning:
The paper demonstrates WorldPM's effectiveness as an initialization for preference model fine-tuning on diverse human preference datasets (HelpSteer2, UltraFeedback, RLHFlow) of varying scales (7K to 800K samples). Results show that models initialized with WorldPM achieve broad generalization improvements across various evaluation categories (subjective, objective, adversarial, safety) compared to training from scratch, especially when the fine-tuning data is limited. The benefits from fine-tuning are also shown to correlate positively with the scale of the WorldPM model used for initialization.
Application to RLHF:
Integrating WorldPM into an internal RLHF pipeline (using GRPO optimization) shows significant improvements in alignment performance on both in-house and public evaluation benchmarks (e.g., Arena Hard, Alpaca Eval) compared to baselines without WorldPM initialization.
Discussion and Limitations:
The paper concludes that scaling laws indeed apply to preference modeling, particularly in objective and adversarial domains. However, subjective evaluation remains challenging due to inherent biases like style preference, which can conflict with the preferences learned by WorldPM. The authors suggest that future work should focus on integrating RMs with other reward signals (rule-based, retrieval-augmented) for objective tasks and on developing better annotation strategies and modeling frameworks for subjective preferences that move beyond surface-level cues.
Limitations include the relatively modest scale of the WorldPM dataset (15M pairs, ~30G tokens) compared to next-token prediction pre-training, and the difficulty in capturing the full complexity of subjective preferences and biases beyond simple style features. Denoising experiments applying existing RMs to filter forum data showed limited benefit for high-quality sources like StackExchange and were seen as potentially introducing the filtering RM's own biases rather than capturing broader "world" preferences.
The LaTeX formulas used in the paper are:
- P(Y=0∣x,y0,y1)=sigmoid(rθ(x,y0)−rθ(x,y1))
- LBT=−E(x,y0,y1,Y)∼D[logP(Y∣x,y0,y1)]
- $\max_{\phi} \mathbb{E}_{x \sim \mathcal{D}_{\text{prompt}, \{y_i\}_{i=1}^G \sim \pi_\phi(\cdot|x)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{r_{\theta}(x, y_i) - \mu}{\sigma} - \beta D_{\text{KL}(\pi_{\phi}(y|x) \| \pi_{\text{ref}(y|x)) \right]$ (RL objective example, not the core BT loss)
- R=DTα+ZTβ
- α^,β^=argβ∈R,γ∈RSminn1i=1∑n−(Yilog(sigmoid(Ri))+(1−Yi)log(1−sigmoid(Ri)))
- $\phi(i,j) = \frac{n_{11}n_{00} - n_{10}n_{01}{\sqrt{(n_{1\cdot})(n_{0\cdot})(n_{\cdot1})(n_{\cdot0})}