Understanding the Learning Dynamics of Alignment with Human Feedback (2403.18742v5)
Abstract: Aligning LLMs with human intentions has become a critical task for safely deploying models in real-world systems. While existing alignment approaches have seen empirical success, theoretically understanding how these methods affect model behavior remains an open question. Our work provides an initial attempt to theoretically analyze the learning dynamics of human preference alignment. We formally show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy. Our theory also reveals an intricate phenomenon where the optimization is prone to prioritizing certain behaviors with higher preference distinguishability. We empirically validate our findings on contemporary LLMs and alignment tasks, reinforcing our theoretical insights and shedding light on considerations for future alignment approaches. Disclaimer: This paper contains potentially offensive text; reader discretion is advised.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Anthropic. Introducing claude. https://www.anthropic.com/index/introducing-claude, 2023.
- On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- High-dimensional asymptotics of feature learning: How one gradient step improves the representation. Advances in Neural Information Processing Systems, 35:37932–37946, 2022.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Taken out of context: On measuring situational awareness in llms. arXiv preprint arXiv:2309.00667, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Characterizing manipulation from ai systems. arXiv preprint arXiv:2303.09387, 2023.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
- A model of double descent for high-dimensional binary linear classification. Information and Inference: A Journal of the IMA, 11(2):435–495, 2022.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. Advances in neural information processing systems, 32, 2019.
- Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639, 2023.
- Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Fast convergence rates of deep neural networks for classification. Neural Networks, 138:179–197, 2021.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, 2021.
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
- Understanding the loss surface of neural networks for binary classification. In International Conference on Machine Learning, pp. 2835–2843. PMLR, 2018.
- Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676, 2023.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2022.
- The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- The effects of reward misspecification: Mapping and mitigating misaligned models. In International Conference on Learning Representations, 2022.
- Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
- Ai deception: A survey of examples, risks, and potential solutions. arXiv preprint arXiv:2308.14752, 2023.
- Discovering language model behaviors with model-written evaluations, 2022. URL https://arxiv.org/abs/2212.09251.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Sambale, H. Some notes on concentration for α𝛼\alphaitalic_α-subexponential random variables. In High Dimensional Probability IX: The Ethereal Volume, pp. 167–192. Springer, 2023.
- Goal misgeneralization: Why correct specifications aren’t enough for correct goals. arXiv preprint arXiv:2210.01790, 2022.
- Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
- Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
- A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. arXiv preprint arXiv:2206.01717, 2022.
- Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871, 2023.
- Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Is rlhf more difficult than standard rl? a theoretical perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
- Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds. Research, 6:0024, 2023.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019a.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019b.