Optimizing Language Models for Human Preferences is a Causal Inference Problem (2402.14979v2)
Abstract: As LLMs see greater use in academic and commercial settings, there is increasing interest in methods that allow LLMs to generate texts aligned with human preferences. In this paper, we present an initial exploration of LLM optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that LLM optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.
- Policy Learning With Observational Data. Econometrica, 89(1):133–161, 2021. ISSN 0012-9682. 10.3982/ecta15732. arXiv: 1702.02896.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- On the opportunities and risks of foundation models, 2022.
- Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
- Locally Robust Semiparametric Estimation. Econometrica, 90(4):1501–1535, 2022. ISSN 1468-0262.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
- Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
- Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 1097–1104, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195.
- A density estimation perspective on learning from pairwise human preferences, 2024.
- Who’s harry potter? approximate unlearning in llms, 2023.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR, 17–23 Jul 2022.
- Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.
- Causal inference with latent treatments. American Journal of Political Science, 2021. URL https://onlinelibrary.wiley.com/doi/full/10.1111/ajps.12649.
- Jaroslav Hájek. Comment on "An essay on the logical foundations of survey sampling, part one". In V.P. Godambe and D.A. Sprott, editors, Foundations of Statistical Inference. Holt, Rinehart and Winston, Toronto, 1971.
- Contrastive preference learning: Learning from human feedback without rl, 2023.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=g0QovXbFw3.
- Doubly robust off-policy value evaluation for reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 652–661, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/jiang16.html.
- Doubly robust distributionally robust off-policy evaluation and learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 10598–10632. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/kallus22a.html.
- What’s in a name? understanding the interplay between titles, content, and communities in social media. Proceedings of the International AAAI Conference on Web and Social Media, 7(1):311–320, Aug. 2021. 10.1609/icwsm.v7i1.14408. URL https://ojs.aaai.org/index.php/ICWSM/article/view/14408.
- Text-transport: Toward learning causal effects of natural language. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1288–1304, Singapore, December 2023. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.82. URL https://aclanthology.org/2023.emnlp-main.82.
- Tofu: A task of fictitious unlearning for llms, 2024.
- Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, page 165–172, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450324090. 10.1145/2507157.2507163. URL https://doi.org/10.1145/2507157.2507163.
- Jerzy Neyman. On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science, 5(4):465–472, 1923 [1990].
- Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc., 2022. URL https://arxiv.org/pdf/2203.02155.pdf.
- Can sensitive information be deleted from LLMs? objectives for defending against extraction attacks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7erlRDoaV8.
- A benchmark dataset for learning to intervene in online hate speech. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4755–4764, Hong Kong, China, November 2019. Association for Computational Linguistics. 10.18653/v1/D19-1482. URL https://aclanthology.org/D19-1482.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9.
- Estimation of Regression Coefficients When Some Regressors are not Always Observed. Journal of the American Statistical Association, 89427:846–866, 1994. 10.1080/01621459.1994.10476818doi.org/10.1080/01621459.1994.10476818. URL http://www.tandfonline.com/action/journalInformation?journalCode=uasa20.
- Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.
- Bloom: A 176b-parameter open-access multilingual language model, 2023.
- Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.
- Model assisted survey sampling (springer series in statistics). 2003.
- Doubly robust bias reduction in infinite horizon off-policy estimation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1glGANtDr.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Opt: Open pre-trained transformer language models, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.