R.I.P.: Better Models by Survival of the Fittest Prompts (2501.18578v2)

Published 30 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, which is from 18th place to 6th overall in the leaderboard.

Summary

The paper introduces R.I.P., a data selection method filtering instruction tuning prompts based on rejected response quality and reward gap to improve model performance.
Experimental results demonstrate R.I.P. significantly improves language model performance, achieving higher win rates on benchmarks like AlpacaEval2 (from 48.4% to 57.8%) and Arena-Hard scores (from 70.5 to 82.9).
R.I.P. proves robust across settings and extends to generating higher-quality synthetic data (Self-RIP) than standard methods or even curated human data.

The paper introduces a data selection methodology for instruction tuning that leverages the quality signals contained in paired responses to a given prompt. The approach, termed Rejecting Instruction Preferences (RIP), is based on two key hypotheses:

Low-quality prompts yield uniformly low-quality or highly variable responses.
Prompt quality can be indirectly measured by scrutinizing the properties of the “rejected” response in a preference pair as well as the reward gap between the chosen and rejected responses.

In the proposed framework, each training prompt is associated with a pair of responses—one designated as the “chosen” response (highest reward) and one as the “rejected” response (lowest reward or, in alternative experiments, selected by other pairing strategies). The method defines three metrics for a prompt $x$ with responses $y_w$ (winning) and $y_l$ (losing):

Rejected response reward: $r(y_l \mid x)$
Rejected response length: $\text{len}(y_l)$
Reward gap: $r(y_w \mid x) - r(y_l \mid x)$

Thresholds are then applied such that prompts are filtered and only those satisfying a lower bound on both the rejected reward and its length while maintaining a sufficiently small reward gap are retained. This filtering step is shown to drastically reduce the size of the training set (e.g., from 20k to around 4.5k prompts in one WildChat setup) while yielding significant improvements in downstream performance.

Key experimental findings include:

Human-Written Prompts: When training on WildChat data with RIP filtering applied, models based on Llama 3.1-8B-Instruct improved the AlpacaEval2 length-controlled (LC) win rate from 48.4% (no filtering) to 57.8% and achieved comparable gains on Arena-Hard and WildBench benchmarks. Similarly, a Llama 3.3-70B-Instruct model trained with RIP-filtered prompts improved its Arena-Hard score from 70.5 to 82.9, moving it significantly higher in the leaderboard.
Synthetic Data Generation (Self-RIP): The method extends naturally to synthetic prompt generation. By initializing a seed pool with RIP-curated human instructions and using few-shot prompting to generate new prompts, followed by a second round of RIP filtering, the newly generated data (termed Self-RIP) outperforms both standard Self-Instruct data and even human-written instructions under several evaluation metrics.
Robustness and Ablation Studies: The authors conducted extensive ablations over variations in rejection sampling (e.g., best-versus-worst compared to best-versus-bottom-percentile pairings) and demonstrated that while alternative pairing strategies improve performance marginally, the separation of high-quality prompts via RIP is the primary driver of gains. The method remains robust across different choices of response sampling numbers (ranging from $N=8$ to $N=64$ ).
Data Analysis: Using t-SNE visualizations of prompt embeddings, the paper shows that RIP effectively removes clusters that correspond to ambiguous, unsafe, or highly noisy prompts. Additionally, an evaluation using a GPT-4 based prompt judge indicates that the filtered set contains no extremely low-quality or unsafe prompts, thereby addressing both the alignment and safety aspects of training data.

The presented technique is positioned as a general filtering approach that can be seamlessly integrated into various preference optimization or reinforcement learning from human feedback methods, including Direct Preference Optimization (DPO). The combination of quantitatively derived metrics with thorough ablation experiments and qualitative analyses underlines the viability of using rejected response quality and reward gap as proxy measures for prompt effectiveness. This work ultimately demonstrates that careful curation of the training prompt space—whether for human-written or synthetic instructions—can yield significant and robust performance improvements in LLMs.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (7)

Tweets

https://twitter.com/jaseweston/status/1885160135053459934

https://twitter.com/jaseweston/status/1885338383419138420

https://twitter.com/tesatory/status/1919462552976949444

https://twitter.com/rohanpaul_ai/status/1891208439340142614

https://twitter.com/xlr8harder/status/1887453884265996769

https://twitter.com/fly51fly/status/1885820247678312918

YouTube

Show All Videos