Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Theoretical guarantees on the best-of-n alignment policy (2401.01879v1)

Published 3 Jan 2024 in cs.LG, cs.CL, cs.IT, and math.IT

Abstract: A simple and effective method for the alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a base policy, and ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the base policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes. Finally, we propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  3. Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
  4. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
  5. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.  10835–10866. PMLR, 2023.
  6. Compositional preference models for aligning LMs. arXiv preprint arXiv:2310.13011, 2023.
  7. Measuring Goodhart’s law, April 2022 (Accessed on January 3, 2024). URL https://openai.com/research/measuring-goodharts-law.
  8. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023.
  9. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  10. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  11. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  12. Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
  13. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  14. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  15. FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3511–3535, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.276. URL https://aclanthology.org/2021.naacl-main.276.
  16. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ahmad Beirami (86 papers)
  2. Alekh Agarwal (99 papers)
  3. Jonathan Berant (107 papers)
  4. Alexander D'Amour (37 papers)
  5. Jacob Eisenstein (73 papers)
  6. Chirag Nagpal (25 papers)
  7. Ananda Theertha Suresh (73 papers)
Citations (24)