Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The History and Risks of Reinforcement Learning and Human Feedback (2310.13595v2)

Published 20 Oct 2023 in cs.CY
The History and Risks of Reinforcement Learning and Human Feedback

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make LLMs easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of human preferences that acts as a reward function for optimization. This approach, which operates at the intersection of many stakeholders and academic disciplines, remains poorly understood. RLHF reward models are often cited as being central to achieving performance, yet very few descriptors of capabilities, evaluations, training methods, or open-source models exist. Given this lack of information, further study and transparency is needed for learned RLHF reward models. In this paper, we illustrate the complex history of optimizing preferences, and articulate lines of inquiry to understand the sociotechnical context of reward models. In particular, we highlight the ontological differences between costs, rewards, and preferences at stake in RLHF's foundations, related methodological tensions, and possible research directions to improve general understanding of how reward models function.

A Systematic Examination of Reinforcement Learning from Human Feedback (RLHF)

The paper "The History and Risks of Reinforcement Learning and Human Feedback" presents a comprehensive analysis of the theoretical foundations and practical implementations of Reinforcement Learning from Human Feedback (RLHF). The authors, Nathan Lambert, Thomas Krendl Gilbert, and Tom Zick, delve into the historical and intellectual lineage that informs RLHF, critically addressing the assumptions and presumptions inherent in the process of modeling and optimizing human preferences.

The paper is significant as it emphasizes the lack of transparency and understanding surrounding RLHF reward models, which are central to the performance of LLMs equipped with human-like interaction capabilities, such as OpenAI's ChatGPT and Anthropic's Claude. The authors argue for improved clarity and methodological inquiry into the design and deployment of these models, foregrounding the sociotechnical context involved.

The core contribution of the paper lies in its historical tracing of RLHF's intellectual ancestry, linking the evolution of preference quantification to modern reinforcement learning. The authors detail the historical convergences that have shaped the existing RLHF framework, identifying key assumptions that underpin current methodologies. Among these assumptions are the quantifiability of human preferences, the presupposition that optimal solutions exist for stated optimization problems, and the notion that the reward signal accurately captures user preferences without compromising their complexity and variability.

The authors underscore that despite RLHF drawing from mature fields like control theory and behavioral economics, the method faces issues when it comes to human preference modeling, as preferences are inherently contextual, temporal, and often ambiguous. This raises critical questions about whether the aggregation of binary pairwise preferences into a single reward model truly reflects human values across diverse contexts and populations.

Moreover, the paper provides a scholarly discussion on specific assumptions and presumptions ranging from the representational adequacy of pairwise preferences, the conflation of human values and reward functions, to the methodological tensions in reinforcement learning algorithms derived from disparate disciplinary origins. Implicit biases in data collection and model training further complicate the use of RLHF, requiring interrogation into the demographics and contexts of the data annotation processes.

In terms of methodological development, the authors propose a series of questions related to model training, data curation, and optimization practices, urging researchers to systematically evaluate reward models’ capabilities and potential hazards. The importance of clear documentation, rigorous testing protocols, and thoughtful contextual consideration is highlighted in proposing more robust model evaluation frameworks.

The paper navigates RLHF's speculative future directions, highlighting alternatives such as direct preference optimization and synthetic preference data, while cautioning about the challenges these nascent methods might harbor. Additionally, the authors discuss issues such as the stability of RLHF-trained LLMs and emphasize the necessity for understanding societal impacts as they suggest solutions like red-teaming of reward models for safety assurance.

In conclusion, this paper advocates for a nuanced understanding of RLHF systems, proposing comprehensive evaluations and discussions around the implicit assumptions within current deployments of reward models. By critically examining RLHF from its intellectual roots to contemporary practices, the authors provide valuable insights, promoting responsible and technically sound advancements in human-centered AI systems. This substantial contribution could lead to better-informed policy and development decisions, ensuring the ethical deployment of AI technologies while better aligning models with genuine human values.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (97)
  1. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleOn the expressivity of markov reward On the expressivity of markov reward.\BBCQ \APACjournalVolNumPagesAdvances in Neural Information Processing Systems347799–7812. \PrintBackRefs\CurrentBib
  2. \APACrefYearMonthDay2021. \BBOQ\APACrefatitlePersistent anti-muslim bias in large language models Persistent anti-muslim bias in large language models.\BBCQ \BIn \APACrefbtitleProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society Proceedings of the 2021 aaai/acm conference on ai, ethics, and society (\BPGS 298–306). \PrintBackRefs\CurrentBib
  3. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleDeep reinforcement learning at the edge of the statistical precipice Deep reinforcement learning at the edge of the statistical precipice.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems3429304–29320. \PrintBackRefs\CurrentBib
  4. \APACrefYearMonthDay\bibnodate. \APACrefbtitleSelf-Consuming Generative Models Go MAD. Self-Consuming Generative Models Go MAD. {APACrefURL} [2023-08-01]http://arxiv.org/abs/2307.01850 {APACrefDOI} \doi10.48550/arXiv.2307.01850 \PrintBackRefs\CurrentBib
  5. \APACinsertmetastararnauld1861port{APACrefauthors}Arnauld, A.  \APACrefYear1662. \APACrefbtitleThe Port-Royal Logic The port-royal logic. \PrintBackRefs\CurrentBib
  6. \APACinsertmetastararrow1950difficulty{APACrefauthors}Arrow, K\BPBIJ.  \APACrefYearMonthDay1950. \BBOQ\APACrefatitleA difficulty in the concept of social welfare A difficulty in the concept of social welfare.\BBCQ \APACjournalVolNumPagesJournal of political economy584328–346. \PrintBackRefs\CurrentBib
  7. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleA general language assistant as a laboratory for alignment A general language assistant as a laboratory for alignment.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2112.00861. \PrintBackRefs\CurrentBib
  8. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleTraining a helpful and harmless assistant with reinforcement learning from human feedback Training a helpful and harmless assistant with reinforcement learning from human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2204.05862. \PrintBackRefs\CurrentBib
  9. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleConstitutional AI: Harmlessness from AI Feedback Constitutional ai: Harmlessness from ai feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2212.08073. \PrintBackRefs\CurrentBib
  10. \APACinsertmetastardefcon{APACrefauthors}Bajak, F.  \APACrefYearMonthDay2023. \APACrefbtitleHackers red-teaming A.I. are ‘breaking stuff left and right,’ but don’t expect quick fixes from DefCon: ‘There are no good guardrails’. Hackers red-teaming a.i. are ‘breaking stuff left and right,’ but don’t expect quick fixes from defcon: ‘there are no good guardrails’. \APAChowpublishedhttps://fortune.com/2023/08/13/hackers-red-teaming-ai-defcon-breaking-stuff-but-no-quick-fixes/. \APACrefnoteAccessed: 2023-10-03 \PrintBackRefs\CurrentBib
  11. \APACrefYearMonthDay2023. \APACrefbtitlePeering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models. Peering through preferences: Unraveling feedback acquisition for aligning large language models. \PrintBackRefs\CurrentBib
  12. \APACinsertmetastarbaum2020social{APACrefauthors}Baum, S\BPBID.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleSocial choice ethics in artificial intelligence Social choice ethics in artificial intelligence.\BBCQ \APACjournalVolNumPagesAI & SOCIETY351165–176. \PrintBackRefs\CurrentBib
  13. \APACinsertmetastarbellman1957markovian{APACrefauthors}Bellman, R.  \APACrefYearMonthDay1957. \BBOQ\APACrefatitleA Markovian decision process A markovian decision process.\BBCQ \APACjournalVolNumPagesJournal of mathematics and mechanics679–684. \PrintBackRefs\CurrentBib
  14. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleOn the dangers of stochastic parrots: Can language models be too big? On the dangers of stochastic parrots: Can language models be too big?\BBCQ \BIn \APACrefbtitleProceedings of the 2021 ACM conference on fairness, accountability, and transparency Proceedings of the 2021 acm conference on fairness, accountability, and transparency (\BPGS 610–623). \PrintBackRefs\CurrentBib
  15. \APACinsertmetastarbentham1823hedonic{APACrefauthors}Bentham, J.  \APACrefYear1823. \APACrefbtitleAn Introduction to the Principles of Morals and Legislation An introduction to the principles of morals and legislation. \PrintBackRefs\CurrentBib
  16. \APACrefYearMonthDay1952. \BBOQ\APACrefatitleRank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons Rank analysis of incomplete block designs: I. the method of paired comparisons.\BBCQ \APACjournalVolNumPagesBiometrika393/4324–345. {APACrefURL} [2023-02-13]http://www.jstor.org/stable/2334029 \PrintBackRefs\CurrentBib
  17. \APACinsertmetastarbriggs2014normative{APACrefauthors}Briggs, R\BPBIA.  \APACrefYearMonthDay2014. \BBOQ\APACrefatitleNormative theories of rational choice: Expected utility Normative theories of rational choice: Expected utility.\BBCQ \PrintBackRefs\CurrentBib
  18. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleOpen problems and fundamental limitations of reinforcement learning from human feedback Open problems and fundamental limitations of reinforcement learning from human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.15217. \PrintBackRefs\CurrentBib
  19. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleA survey on evaluation of large language models A survey on evaluation of large language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.03109. \PrintBackRefs\CurrentBib
  20. \APACrefYearMonthDay2010. \BBOQ\APACrefatitleLabel ranking methods based on the Plackett-Luce model Label ranking methods based on the plackett-luce model.\BBCQ \BIn \APACrefbtitleProceedings of the 27th International Conference on Machine Learning (ICML-10) Proceedings of the 27th international conference on machine learning (icml-10) (\BPGS 215–222). \PrintBackRefs\CurrentBib
  21. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleDeep reinforcement learning from human preferences Deep reinforcement learning from human preferences.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems30. \PrintBackRefs\CurrentBib
  22. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleMagnetic control of tokamak plasmas through deep reinforcement learning Magnetic control of tokamak plasmas through deep reinforcement learning.\BBCQ \APACjournalVolNumPagesNature6027897414–419. \PrintBackRefs\CurrentBib
  23. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleHard choices in artificial intelligence Hard choices in artificial intelligence.\BBCQ \APACjournalVolNumPagesArtificial Intelligence300103555. \PrintBackRefs\CurrentBib
  24. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleMoral Machine or Tyranny of the Majority? Moral machine or tyranny of the majority?\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.17319. \PrintBackRefs\CurrentBib
  25. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleBridging the gap: A survey on integrating (human) feedback for natural language generation Bridging the gap: A survey on integrating (human) feedback for natural language generation.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.00955. \PrintBackRefs\CurrentBib
  26. \APACrefYearMonthDay2020. \BBOQ\APACrefatitleMulti-principal assistance games Multi-principal assistance games.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2007.09540. \PrintBackRefs\CurrentBib
  27. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleRed teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2209.07858. \PrintBackRefs\CurrentBib
  28. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleScaling Laws for Reward Model Overoptimization Scaling laws for reward model overoptimization.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2210.10760. \PrintBackRefs\CurrentBib
  29. \APACrefYearMonthDay2020. \BBOQ\APACrefatitleRealtoxicityprompts: Evaluating neural toxic degeneration in language models Realtoxicityprompts: Evaluating neural toxic degeneration in language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2009.11462. \PrintBackRefs\CurrentBib
  30. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleReward reports for reinforcement learning Reward reports for reinforcement learning.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2204.10817. \PrintBackRefs\CurrentBib
  31. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleChoices, Risks, and Reward Reports: Charting Public Policy for Reinforcement Learning Systems Choices, risks, and reward reports: Charting public policy for reinforcement learning systems.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2202.05716. \PrintBackRefs\CurrentBib
  32. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleImproving alignment of dialogue agents via targeted human judgements Improving alignment of dialogue agents via targeted human judgements.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2209.14375. \PrintBackRefs\CurrentBib
  33. \APACrefYear2017. \APACrefbtitleAutomatic control systems Automatic control systems. \APACaddressPublisherMcGraw-Hill Education. \PrintBackRefs\CurrentBib
  34. \APACrefYearMonthDay\bibnodate. \APACrefbtitleThe False Promise of Imitating Proprietary LLMs. The False Promise of Imitating Proprietary LLMs. {APACrefURL} [2023-08-01]http://arxiv.org/abs/2305.15717 {APACrefDOI} \doi10.48550/arXiv.2305.15717 \PrintBackRefs\CurrentBib
  35. \APACrefYearMonthDay2014. \BBOQ\APACrefatitleMicrofoundations of the Rule of Law Microfoundations of the rule of law.\BBCQ \APACjournalVolNumPagesAnnual Review of Political Science1721–42. \PrintBackRefs\CurrentBib
  36. \APACrefYearMonthDay2016. \BBOQ\APACrefatitleCooperative inverse reinforcement learning Cooperative inverse reinforcement learning.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems29. \PrintBackRefs\CurrentBib
  37. \APACinsertmetastarharsanyi1977rule{APACrefauthors}Harsanyi, J\BPBIC.  \APACrefYearMonthDay1977. \BBOQ\APACrefatitleRule utilitarianism and decision theory Rule utilitarianism and decision theory.\BBCQ \APACjournalVolNumPagesErkenntnis11125–53. \PrintBackRefs\CurrentBib
  38. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices The ethical ambiguity of ai data enrichment: Measuring gaps in research ethics norms and practices.\BBCQ \BIn \APACrefbtitleProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency Proceedings of the 2023 acm conference on fairness, accountability, and transparency (\BPGS 261–270). \PrintBackRefs\CurrentBib
  39. \APACinsertmetastarhirschman1984against{APACrefauthors}Hirschman, A\BPBIO.  \APACrefYearMonthDay1984. \BBOQ\APACrefatitleAgainst parsimony: Three easy ways of complicating some categories of economic discourse Against parsimony: Three easy ways of complicating some categories of economic discourse.\BBCQ \APACjournalVolNumPagesBulletin of the American Academy of arts and Sciences37811–28. \PrintBackRefs\CurrentBib
  40. \APACinsertmetastarhoward1960dynamic{APACrefauthors}Howard, R\BPBIA.  \APACrefYearMonthDay1960. \BBOQ\APACrefatitleDynamic programming and markov processes. Dynamic programming and markov processes.\BBCQ \PrintBackRefs\CurrentBib
  41. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleThe Ghost in the Machine has an American accent: value conflict in GPT-3 The ghost in the machine has an american accent: value conflict in gpt-3.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2203.07785. \PrintBackRefs\CurrentBib
  42. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleChampion-level drone racing using deep reinforcement learning Champion-level drone racing using deep reinforcement learning.\BBCQ \APACjournalVolNumPagesNature6207976982–987. {APACrefURL} https://doi.org/10.1038/s41586-023-06419-4 {APACrefDOI} \doi10.1038/s41586-023-06419-4 \PrintBackRefs\CurrentBib
  43. \APACrefYearMonthDay2023. \BBOQ\APACrefatitlePersonalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2303.05453. \PrintBackRefs\CurrentBib
  44. \APACinsertmetastarklopf1972brain{APACrefauthors}Klopf, A\BPBIH.  \APACrefYear1972. \APACrefbtitleBrain function and adaptive systems: a heterostatic theory Brain function and adaptive systems: a heterostatic theory (\BNUM 133). \APACaddressPublisherAir Force Cambridge Research Laboratories, Air Force Systems Command, United …. \PrintBackRefs\CurrentBib
  45. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleChatGPT’s inconsistent moral advice influences users’ judgment Chatgpt’s inconsistent moral advice influences users’ judgment.\BBCQ \APACjournalVolNumPagesScientific Reports1314569. \PrintBackRefs\CurrentBib
  46. \APACrefYearMonthDay2023. \APACrefbtitleRLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. \PrintBackRefs\CurrentBib
  47. \APACrefYearMonthDay2018. \BBOQ\APACrefatitleScalable agent alignment via reward modeling: a research direction Scalable agent alignment via reward modeling: a research direction.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1811.07871. \PrintBackRefs\CurrentBib
  48. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleTowards understanding and mitigating social biases in language models Towards understanding and mitigating social biases in language models.\BBCQ \BIn \APACrefbtitleInternational Conference on Machine Learning International conference on machine learning (\BPGS 6565–6576). \PrintBackRefs\CurrentBib
  49. \APACrefYear2014. \APACrefbtitleThe Arrow impossibility theorem The arrow impossibility theorem. \APACaddressPublisherColumbia University Press. \PrintBackRefs\CurrentBib
  50. \APACrefYearMonthDay2003. \BBOQ\APACrefatitleA computational substrate for incentive salience A computational substrate for incentive salience.\BBCQ \APACjournalVolNumPagesTrends in neurosciences268423–428. \PrintBackRefs\CurrentBib
  51. \APACrefYearMonthDay1970. \BBOQ\APACrefatitle8 Reinforcement-Learning Control and Pattern Recognition Systems 8 reinforcement-learning control and pattern recognition systems.\BBCQ \BIn J. Mendel \BBA K. Fu (\BEDS), \APACrefbtitleAdaptive, Learning and Pattern Recognition Systems Adaptive, learning and pattern recognition systems (\BVOL 66, \BPG 287-318). \APACaddressPublisherElsevier. {APACrefURL} https://www.sciencedirect.com/science/article/pii/S007653920860497X {APACrefDOI} \doihttps://doi.org/10.1016/S0076-5392(08)60497-X \PrintBackRefs\CurrentBib
  52. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleTeaching language models to support answers with verified quotes Teaching language models to support answers with verified quotes.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2203.11147. \PrintBackRefs\CurrentBib
  53. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleShould robots be obedient? Should robots be obedient?\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1705.09990. \PrintBackRefs\CurrentBib
  54. \APACrefYearMonthDay2013. \BBOQ\APACrefatitlePlaying atari with deep reinforcement learning Playing atari with deep reinforcement learning.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1312.5602. \PrintBackRefs\CurrentBib
  55. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleWebgpt: Browser-assisted question-answering with human feedback Webgpt: Browser-assisted question-answering with human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2112.09332. \PrintBackRefs\CurrentBib
  56. \APACinsertmetastarnardo2023waluigi{APACrefauthors}Nardo, C.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe waluigi effect (mega-post) The waluigi effect (mega-post).\BBCQ \APACjournalVolNumPagesLess Wrong. {APACrefURL} https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post \APACrefnoteAccessed: 2023-09-11 \PrintBackRefs\CurrentBib
  57. \APACrefYearMonthDay1999. \BBOQ\APACrefatitlePolicy invariance under reward transformations: Theory and application to reward shaping Policy invariance under reward transformations: Theory and application to reward shaping.\BBCQ \BIn \APACrefbtitleIcml Icml (\BVOL 99, \BPGS 278–287). \PrintBackRefs\CurrentBib
  58. \APACrefYearMonthDay2000. \BBOQ\APACrefatitleAlgorithms for inverse reinforcement learning. Algorithms for inverse reinforcement learning.\BBCQ \BIn \APACrefbtitleIcml Icml (\BVOL 1, \BPG 2). \PrintBackRefs\CurrentBib
  59. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleTraining language models to follow instructions with human feedback Training language models to follow instructions with human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2203.02155. \PrintBackRefs\CurrentBib
  60. \APACinsertmetastarpettigrew2019choosing{APACrefauthors}Pettigrew, R.  \APACrefYear2019. \APACrefbtitleChoosing for changing selves Choosing for changing selves. \APACaddressPublisherOxford University Press. \PrintBackRefs\CurrentBib
  61. \APACinsertmetastarpitis2019rethinking{APACrefauthors}Pitis, S.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleRethinking the discount factor in reinforcement learning: A decision theoretic approach Rethinking the discount factor in reinforcement learning: A decision theoretic approach.\BBCQ \BIn \APACrefbtitleProceedings of the AAAI Conference on Artificial Intelligence Proceedings of the aaai conference on artificial intelligence (\BVOL 33, \BPGS 7949–7956). \PrintBackRefs\CurrentBib
  62. \APACinsertmetastarpitis2023consistent{APACrefauthors}Pitis, S.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleConsistent Aggregation of Objectives with Diverse Time Preferences Requires Non-Markovian Rewards Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2310.00435. \PrintBackRefs\CurrentBib
  63. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleOn releasing annotator-level labels and information in datasets On releasing annotator-level labels and information in datasets.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2110.05699. \PrintBackRefs\CurrentBib
  64. \APACinsertmetastarprasad2018social{APACrefauthors}Prasad, M.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleSocial choice and the value alignment problem Social choice and the value alignment problem.\BBCQ \APACjournalVolNumPagesArtificial intelligence safety and security291–314. \PrintBackRefs\CurrentBib
  65. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleDirect preference optimization: Your language model is secretly a reward model Direct preference optimization: Your language model is secretly a reward model.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.18290. \PrintBackRefs\CurrentBib
  66. \APACinsertmetastarramsey2016truth{APACrefauthors}Ramsey, F\BPBIP.  \APACrefYearMonthDay2016. \BBOQ\APACrefatitleTruth and probability Truth and probability.\BBCQ \APACjournalVolNumPagesReadings in Formal Epistemology: Sourcebook21–45. \PrintBackRefs\CurrentBib
  67. \APACinsertmetastarsalha2011aesthetics{APACrefauthors}Salha, N.  \APACrefYear2011.  \APACrefbtitleAesthetics & Art in the Early Development of Human-Computer Interfaces Aesthetics & art in the early development of human-computer interfaces \APACtypeAddressSchool\BUPhD.  \APACaddressSchoolTesis de doctorado en ingeniería, Universität Bremen]. https://bit. ly/3ZICKZJ. \PrintBackRefs\CurrentBib
  68. \APACinsertmetastarschulman2023proxy{APACrefauthors}Schulman, J.  \APACrefYearMonthDay2023. \APACrefbtitleProxy objectives in reinforcement learning from human feedback. Proxy objectives in reinforcement learning from human feedback. {APACrefURL} https://icml.cc/virtual/2023/invited-talk/21549 \APACrefnoteInternational Conference on Machine Learning (ICML) \PrintBackRefs\CurrentBib
  69. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleProximal policy optimization algorithms Proximal policy optimization algorithms.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1707.06347. \PrintBackRefs\CurrentBib
  70. \APACrefYearMonthDay2022. \APACrefbtitleChatGPT: Optimizing Language Models for Dialogue. Chatgpt: Optimizing language models for dialogue. \APAChowpublishedhttps://openai.com/blog/chatgpt/. \APACrefnoteAccessed: 2023-02-12 \PrintBackRefs\CurrentBib
  71. \APACinsertmetastarsen1973behaviour{APACrefauthors}Sen, A.  \APACrefYearMonthDay1973. \BBOQ\APACrefatitleBehaviour and the Concept of Preference Behaviour and the concept of preference.\BBCQ \APACjournalVolNumPagesEconomica40159241–259. \PrintBackRefs\CurrentBib
  72. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe Trickle-down Impact of Reward (In-) consistency on RLHF The trickle-down impact of reward (in-) consistency on rlhf.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2309.16155. \PrintBackRefs\CurrentBib
  73. \APACrefYearMonthDay\bibnodate. \APACrefbtitleThe Curse of Recursion: Training on Generated Data Makes Models Forget. The Curse of Recursion: Training on Generated Data Makes Models Forget. {APACrefURL} [2023-08-01]http://arxiv.org/abs/2305.17493 {APACrefDOI} \doi10.48550/arXiv.2305.17493 \PrintBackRefs\CurrentBib
  74. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleMastering the game of go without human knowledge Mastering the game of go without human knowledge.\BBCQ \APACjournalVolNumPagesnature5507676354–359. \PrintBackRefs\CurrentBib
  75. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleReward is enough Reward is enough.\BBCQ \APACjournalVolNumPagesArtificial Intelligence299103535. \PrintBackRefs\CurrentBib
  76. \APACrefYearMonthDay2009. \BBOQ\APACrefatitleWhere do rewards come from Where do rewards come from.\BBCQ \BIn \APACrefbtitleProceedings of the annual conference of the cognitive science society Proceedings of the annual conference of the cognitive science society (\BPGS 2601–2606). \PrintBackRefs\CurrentBib
  77. \APACinsertmetastarskinner2019behavior{APACrefauthors}Skinner, B\BPBIF.  \APACrefYear2019. \APACrefbtitleThe behavior of organisms: An experimental analysis The behavior of organisms: An experimental analysis. \APACaddressPublisherBF Skinner Foundation. \PrintBackRefs\CurrentBib
  78. \APACrefYearMonthDay2015. \BBOQ\APACrefatitleCorrigibility Corrigibility.\BBCQ \BIn \APACrefbtitleWorkshops at the twenty-ninth AAAI conference on artificial intelligence. Workshops at the twenty-ninth aaai conference on artificial intelligence. \PrintBackRefs\CurrentBib
  79. \APACinsertmetastarspaan2012partially{APACrefauthors}Spaan, M\BPBIT.  \APACrefYearMonthDay2012. \BBOQ\APACrefatitlePartially observable Markov decision processes Partially observable markov decision processes.\BBCQ \BIn \APACrefbtitleReinforcement learning: State-of-the-art Reinforcement learning: State-of-the-art (\BPGS 387–414). \APACaddressPublisherSpringer. \PrintBackRefs\CurrentBib
  80. \APACrefYearMonthDay2020. \BBOQ\APACrefatitleLearning to summarize with human feedback Learning to summarize with human feedback.\BBCQ \BIn H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan\BCBL \BBA H. Lin (\BEDS), \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BVOL 33, \BPGS 3008–3021). \APACaddressPublisherCurran Associates, Inc. {APACrefURL} https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf \PrintBackRefs\CurrentBib
  81. \APACinsertmetastarsuperalignment{APACrefauthors}Sutskever, J\BPBIL\BPBII.  \APACrefYearMonthDay2023. \APACrefbtitleIntroducing Superalignment. Introducing superalignment. \APAChowpublishedhttps://openai.com/blog/introducing-superalignment. \APACrefnoteAccessed: 2023-09-27 \PrintBackRefs\CurrentBib
  82. \APACinsertmetastarsutton1988learning{APACrefauthors}Sutton, R\BPBIS.  \APACrefYearMonthDay1988. \BBOQ\APACrefatitleLearning to predict by the methods of temporal differences Learning to predict by the methods of temporal differences.\BBCQ \APACjournalVolNumPagesMachine learning39–44. \PrintBackRefs\CurrentBib
  83. \APACrefYear2018. \APACrefbtitleReinforcement learning: An introduction Reinforcement learning: An introduction. \APACaddressPublisherMIT press. \PrintBackRefs\CurrentBib
  84. \APACinsertmetastartesauro1995temporal{APACrefauthors}Tesauro, G.\BCBT \BOthersPeriod.  \APACrefYearMonthDay1995. \BBOQ\APACrefatitleTemporal difference learning and TD-Gammon Temporal difference learning and td-gammon.\BBCQ \APACjournalVolNumPagesCommunications of the ACM38358–68. \PrintBackRefs\CurrentBib
  85. \APACinsertmetastarthorndike1927law{APACrefauthors}Thorndike, E\BPBIL.  \APACrefYearMonthDay1927. \BBOQ\APACrefatitleThe law of effect The law of effect.\BBCQ \APACjournalVolNumPagesThe American journal of psychology391/4212–222. \PrintBackRefs\CurrentBib
  86. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleLlama 2: Open foundation and fine-tuned chat models Llama 2: Open foundation and fine-tuned chat models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.09288. \PrintBackRefs\CurrentBib
  87. \APACrefYearMonthDay1947. \BBOQ\APACrefatitleTheory of games and economic behavior, 2nd rev Theory of games and economic behavior, 2nd rev.\BBCQ \PrintBackRefs\CurrentBib
  88. \APACrefYearMonthDay1965. \BBOQ\APACrefatitleA heuristic approach to reinforcement learning control systems A heuristic approach to reinforcement learning control systems.\BBCQ \APACjournalVolNumPagesIEEE Transactions on Automatic Control104390-398. {APACrefDOI} \doi10.1109/TAC.1965.1098193 \PrintBackRefs\CurrentBib
  89. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleShepherd: A Critic for Language Model Generation Shepherd: A critic for language model generation.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2308.04592. \PrintBackRefs\CurrentBib
  90. \APACrefYearMonthDay1992. \BBOQ\APACrefatitleQ-learning Q-learning.\BBCQ \APACjournalVolNumPagesMachine learning8279–292. \PrintBackRefs\CurrentBib
  91. \APACrefYearMonthDay1960. \APACrefbtitleAdaptive switching circuits Adaptive switching circuits \APACbVolEdTR\BTR. \APACaddressInstitutionStanford Univ Ca Stanford Electronics Labs. \PrintBackRefs\CurrentBib
  92. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleA survey of preference-based reinforcement learning methods A survey of preference-based reinforcement learning methods.\BBCQ \APACjournalVolNumPagesJournal of Machine Learning Research181361–46. \PrintBackRefs\CurrentBib
  93. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleRecursively summarizing books with human feedback Recursively summarizing books with human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2109.10862. \PrintBackRefs\CurrentBib
  94. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleFine-Grained Human Feedback Gives Better Rewards for Language Model Training Fine-grained human feedback gives better rewards for language model training.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.01693. \PrintBackRefs\CurrentBib
  95. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleJudging LLM-as-a-judge with MT-Bench and Chatbot Arena Judging llm-as-a-judge with mt-bench and chatbot arena.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.05685. \PrintBackRefs\CurrentBib
  96. \APACrefYearMonthDay2019. \BBOQ\APACrefatitleFine-tuning language models from human preferences Fine-tuning language models from human preferences.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1909.08593. \PrintBackRefs\CurrentBib
  97. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleUniversal and transferable adversarial attacks on aligned language models Universal and transferable adversarial attacks on aligned language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.15043. \PrintBackRefs\CurrentBib
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nathan Lambert (37 papers)
  2. Thomas Krendl Gilbert (16 papers)
  3. Tom Zick (31 papers)
Citations (36)
Youtube Logo Streamline Icon: https://streamlinehq.com