Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Boosting Large Language Models with Synthetic Preference Data (2410.06961v1)

Published 9 Oct 2024 in cs.CL and cs.AI
Self-Boosting Large Language Models with Synthetic Preference Data

Abstract: Through alignment with human preferences, LLMs have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

Self-Boosting LLMs with Synthetic Preference Data

The paper introduces an innovative approach, SynPO, to enhance LLMs by employing synthetic preference data. This mechanism aims to reduce the reliance on costly human-curated datasets by enabling models to self-generate high-quality training data. The paper's purpose is to align LLM outputs with desired human-like preferences, thereby improving model responses.

Key Methodological Contributions

SynPO uniquely addresses the challenge of high-quality preference data scarcity through a self-boosting paradigm. The process involves two primary components: a self-prompt generator and a response improver.

  1. Self-Prompt Generator: SynPO leverages a self-trained generator that produces diverse prompts from keyword inputs, negating the need for direct human intervention or external model assistance. This generator is trained using a small set of seed data, enabling it to output high-quality synthetic prompts by combining random keywords.
  2. Response Improver: A distinct addition, the response improver uses the model itself to enhance responses iteratively. This component refines initial model completions to create improved outputs, serving as chosen responses for synthetic preference data. This strategy capitalizes on the model's ability to detect distribution gaps and incrementally adjust its responses.
  3. Iterative Training and Preference Optimization: SynPO employs an iterative approach where synthetic prompts and refined responses guide the training process. This iterative loop significantly enhances the model's instruction-following capabilities and overall task performance without the necessity of large, annotated preference datasets.

Experimental Results

The paper reports substantial improvements in model performance on multiple benchmarks such as AlpacaEval 2.0 and Arena-Hard. After four iterations, the SynPO-enhanced Mistral-7B and Llama3-8B models demonstrated over a 22% improvement in win rates compared to initial models. Additionally, an increase of 3.2 to 5.0% in average performance was observed on the Open LLM leaderboard.

Notably, this self-boosting mechanism addresses the "alignment tax" issue prevalent in other LLM alignment strategies. By refining model outputs internally, SynPO reduces the model's dependency on external feedback, allowing it to maintain and even enhance its generalist capabilities across diverse tasks.

Implications and Future Directions

The results suggest that SynPO can dramatically shift how LLMs are trained, providing a cost-effective solution for continuous model improvement without extensive human intervention. The approach offers a pathway for models to autonomously adapt to new tasks by learning from synthetic preferences, potentially reducing the gap in model alignment tasks.

Future research might explore integrating SynPO with on-policy optimization techniques to further refine model performance. Additionally, examining the extent to which SynPO can influence various downstream applications could illuminate broader implications for AI development.

Overall, SynPO represents a significant advancement in the field of LLM alignment and exemplifies the potential of synthetic data to drive AI innovation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Qingxiu Dong (39 papers)
  2. Li Dong (154 papers)
  3. Xingxing Zhang (65 papers)
  4. Zhifang Sui (89 papers)
  5. Furu Wei (291 papers)
Citations (2)