Extensive Self-Contrast Enables Feedback-Free Language Model Alignment
Abstract: Reinforcement learning from human feedback (RLHF) has been a central technique for recent LLM alignment. However, its heavy dependence on costly human or LLM-as-Judge preference feedback could stymie its wider applications. In this work, we introduce Self-Contrast, a feedback-free LLM alignment method via exploiting extensive self-generated negatives. With only supervised fine-tuning (SFT) targets, Self-Contrast leverages the LLM itself to generate massive diverse candidates, and harnesses a pre-trained embedding model to filter multiple negatives according to text similarity. Theoretically, we illustrate that in this setting, merely scaling negative responses can still effectively approximate situations with more balanced positive and negative preference annotations. Our experiments with direct preference optimization (DPO) on three datasets show that, Self-Contrast could consistently outperform SFT and standard DPO training by large margins. And as the number of self-generated negatives increases, the performance of Self-Contrast continues to grow. Code and data are available at https://github.com/THUDM/Self-Contrast.
- Palm 2 technical report, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Constitutional ai: Harmlessness from ai feedback, 2022.
- Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Self-play fine-tuning converts weak language models to strong language models, 2024.
- Black-box prompt optimization: Aligning large language models without model training. arXiv preprint arXiv:2311.04155, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Training verifiers to solve math word problems, 2021.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
- Improving alignment of dialogue agents via targeted human judgements, 2022.
- Measuring massive multitask language understanding, 2021.
- Mistral 7b, 2023.
- Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. arXiv preprint arXiv:2311.18702, 2023.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023.
- X. Li and J. Li. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023.
- Self-alignment with instruction backtranslation, 2023.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods, 2022.
- Alignbench: Benchmarking chinese alignment of large language models. arXiv preprint arXiv:2311.18743, 2023.
- Self-refine: Iterative refinement with self-feedback, 2023.
- Gpt-4 technical report, 2023.
- OpenAI. Introducing chatgpt, 2022.
- Training language models to follow instructions with human feedback, 2022.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Winogrande: An adversarial winograd schema challenge at scale, 2019.
- Proximal policy optimization algorithms, 2017.
- Learning to summarize from human feedback, 2022.
- Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Zephyr: Direct distillation of lm alignment, 2023.
- Openchat: Advancing open-source language models with mixed-quality data, 2023.
- Self-rewarding language models, 2024.
- Hellaswag: Can a machine really finish your sentence?, 2019.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Slic-hf: Sequence likelihood calibration with human feedback, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.