- The paper introduces R.I.P., a data selection method filtering instruction tuning prompts based on rejected response quality and reward gap to improve model performance.
- Experimental results demonstrate R.I.P. significantly improves language model performance, achieving higher win rates on benchmarks like AlpacaEval2 (from 48.4% to 57.8%) and Arena-Hard scores (from 70.5 to 82.9).
- R.I.P. proves robust across settings and extends to generating higher-quality synthetic data (Self-RIP) than standard methods or even curated human data.
The paper introduces a data selection methodology for instruction tuning that leverages the quality signals contained in paired responses to a given prompt. The approach, termed Rejecting Instruction Preferences (RIP), is based on two key hypotheses:
- Low-quality prompts yield uniformly low-quality or highly variable responses.
- Prompt quality can be indirectly measured by scrutinizing the properties of the “rejected” response in a preference pair as well as the reward gap between the chosen and rejected responses.
In the proposed framework, each training prompt is associated with a pair of responses—one designated as the “chosen” response (highest reward) and one as the “rejected” response (lowest reward or, in alternative experiments, selected by other pairing strategies). The method defines three metrics for a prompt x with responses yw (winning) and yl (losing):
- Rejected response reward: r(yl∣x)
- Rejected response length: len(yl)
- Reward gap: r(yw∣x)−r(yl∣x)
Thresholds are then applied such that prompts are filtered and only those satisfying a lower bound on both the rejected reward and its length while maintaining a sufficiently small reward gap are retained. This filtering step is shown to drastically reduce the size of the training set (e.g., from 20k to around 4.5k prompts in one WildChat setup) while yielding significant improvements in downstream performance.
Key experimental findings include:
- Human-Written Prompts: When training on WildChat data with RIP filtering applied, models based on Llama 3.1-8B-Instruct improved the AlpacaEval2 length-controlled (LC) win rate from 48.4% (no filtering) to 57.8% and achieved comparable gains on Arena-Hard and WildBench benchmarks. Similarly, a Llama 3.3-70B-Instruct model trained with RIP-filtered prompts improved its Arena-Hard score from 70.5 to 82.9, moving it significantly higher in the leaderboard.
- Synthetic Data Generation (Self-RIP): The method extends naturally to synthetic prompt generation. By initializing a seed pool with RIP-curated human instructions and using few-shot prompting to generate new prompts, followed by a second round of RIP filtering, the newly generated data (termed Self-RIP) outperforms both standard Self-Instruct data and even human-written instructions under several evaluation metrics.
- Robustness and Ablation Studies: The authors conducted extensive ablations over variations in rejection sampling (e.g., best-versus-worst compared to best-versus-bottom-percentile pairings) and demonstrated that while alternative pairing strategies improve performance marginally, the separation of high-quality prompts via RIP is the primary driver of gains. The method remains robust across different choices of response sampling numbers (ranging from N=8 to N=64).
- Data Analysis: Using t-SNE visualizations of prompt embeddings, the paper shows that RIP effectively removes clusters that correspond to ambiguous, unsafe, or highly noisy prompts. Additionally, an evaluation using a GPT-4 based prompt judge indicates that the filtered set contains no extremely low-quality or unsafe prompts, thereby addressing both the alignment and safety aspects of training data.
The presented technique is positioned as a general filtering approach that can be seamlessly integrated into various preference optimization or reinforcement learning from human feedback methods, including Direct Preference Optimization (DPO). The combination of quantitatively derived metrics with thorough ablation experiments and qualitative analyses underlines the viability of using rejected response quality and reward gap as proxy measures for prompt effectiveness. This work ultimately demonstrates that careful curation of the training prompt space—whether for human-written or synthetic instructions—can yield significant and robust performance improvements in LLMs.