Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game (2311.08045v4)
Abstract: Human preference alignment is essential to improve the interaction quality of LLMs. Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mitigate this issue, previous methods require additional preference annotation on newly generated samples to adapt to the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an Adversarial Preference Optimization (APO) framework, in which the LLM and the reward model update alternatively via a min-max game. Through adversarial training, the reward model can adapt to the shifted generation distribution of the LLM without any additional annotation. With comprehensive experiments, we find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness. The code is at https://github.com/Linear95/APO.
- Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Everyone deserves a reward: Learning customized human preferences. arXiv preprint arXiv:2309.03126.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867.
- Generative adversarial nets. Advances in neural information processing systems, 27.
- Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors. arXiv preprint arXiv:2305.14450.
- Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. Advances in neural information processing systems, 29.
- Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745.
- Can neural machine translation be improved with user feedback? In Proceedings of NAACL-HLT, pages 92–105.
- Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439.
- Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676.
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
- Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- OpenAI. 2023a. ChatGPT, Mar 14 version. https://chat.openai.com/chat.
- OpenAI. 2023b. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Salmon: Self-alignment with principle-following reward models. arXiv preprint arXiv:2310.05910.
- Nigar M Shafiq Surameery and Mohammed Y Shakor. 2023. Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290, 3(01):17–22.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Is chatgpt the ultimate programming assistant–how far is it? arXiv preprint arXiv:2304.11938.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Cédric Villani. 2009. Optimal transport: old and new, volume 338. Springer.
- Openchat: Advancing open-source language models with mixed-quality data.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Textgail: Generative adversarial imitation learning for text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14067–14075.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
- Generating text via adversarial training. NIPS workshop on Adversarial Training.
- Adversarial feature matching for text generation. In International conference on machine learning, pages 4006–4015. PMLR.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Pengyu Cheng (23 papers)
- Yifan Yang (578 papers)
- Jian Li (667 papers)
- Yong Dai (33 papers)
- Tianhao Hu (7 papers)
- Peixin Cao (1 paper)
- Nan Du (66 papers)
- Xiaolong Li (107 papers)