Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs (2404.08555v2)

Published 12 Apr 2024 in cs.LG, cs.AI, and cs.CL
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Abstract: State-of-the-art LLMs have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a LLM. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.

Comprehensive Analysis of Reinforcement Learning from Human Feedback in LLMs

Introduction to RLHF and Its Importance

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique in aligning LLMs with human intentions and preferences. The method extends beyond standard reinforcement learning frameworks by actively incorporating human evaluative feedback into the learning process. Research on RLHF has primarily concentrated on improving LLMs' behavior, tackling tasks where human-like behavior, trustworthiness, and safety are paramount.

Theoretic Underpinnings and Practical Implications

Foundations of RLHF:

RLHF introduces a unique method of fine-tuning LLMs that leverages human feedback to directly shape the model’s outputs. The approach is underpinned by three primary components:

  • Feedback Collection: Gathering human evaluations on model outputs, ranking them, or providing constructive language feedback.
  • Reward Model Training: Developing a model that predicts how well an output aligns with human preferences, based on the collected feedback.
  • Model Fine-Tuning: Utilizing reinforcement learning algorithms to adjust the LLM’s parameters such that outputs that are better aligned with human preferences are more likely to be produced.

Challenges and Limitations:

The paper meticulously discusses several significant challenges associated with RLHF:

  1. Model Misgeneralization: The divergence in performance when faced with novel inputs not covered in the training set.
  2. Reward Sparsity: The inadequacy of frequent and immediate feedback throughout the output generation process, which complicates the training dynamics.
  3. Reward Model Generalization: Ensuring that the reward model generalizes effectively from its training data to unseen examples is critical yet challenging, often requiring iterative refinement and extensive validation against human judgment.

Future Directions in RLHF Research

The future of RLHF promises several intriguing research avenues. One critical area involves refining the reward models to address issues like incorrect generalizations and integration of more nuanced forms of feedback that capture a broader range of human preferences. Moreover, exploring methodologies to reduce the dependency on extensive human feedback by utilizing unsupervised or semi-supervised techniques could broaden the applicability and efficiency of RLHF.

Another prospective development could focus on the incorporation of multi-objective optimization frameworks that allow simultaneous tuning of multiple aspects of model outputs, such as factual accuracy and user engagement, without compromising one for the other.

Conclusion

This paper offers an enriched understanding of the RLHF process, elucidating its contribution to the development of more human-aligned LLMs. Not only does it highlight current achievements and limitations, but it also paves the path for future research that could potentially revolutionize how we fine-tune and deploy LLMs in various real-world applications. Given the complexity of human language and communication, the journey of refining RLHF is poised to be both challenging and rewarding, with substantial implications for AI's role in society.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (197)
  1. Semsup-xc: Semantic supervision for zero and few-shot extreme classification. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:256274863.
  2. Instructeval: Systematic evaluation of instruction selection methods. ArXiv, abs/2307.00259, 2023. URL https://api.semanticscholar.org/CorpusID:259316853.
  3. Self-consuming generative models go mad, 2023.
  4. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  5. Palm 2 technical report. 2023.
  6. Director: Generator-classifiers for supervised language modeling. In AACL, 2022.
  7. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021.
  8. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34:26–38, 2017. URL https://api.semanticscholar.org/CorpusID:4884302.
  9. A general language assistant as a laboratory for alignment. ArXiv, abs/2112.00861, 2021.
  10. A general theoretical paradigm to understand learning from human preferences, 2023.
  11. An actor-critic algorithm for sequence prediction. ArXiv, abs/1607.07086, 2016.
  12. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862, 2022a.
  13. Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073, 2022b.
  14. Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Computer Methods and Programs in Biomedicine, 213:106504, 2022.
  15. Learning robot objectives from physical human interaction. In Conference on Robot Learning, 2017. URL https://api.semanticscholar.org/CorpusID:28406224.
  16. Active reward learning from multiple teachers. arXiv preprint arXiv:2303.00894, 2023.
  17. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13:41–77, 2003. URL https://api.semanticscholar.org/CorpusID:386824.
  18. Unifying count-based exploration and intrinsic motivation. In NIPS, 2016. URL https://api.semanticscholar.org/CorpusID:8310565.
  19. Neuro-dynamic programming. Athena Scientific, 1996.
  20. Christopher Bishop. Pattern recognition and machine learning. Springer google schola, 2:531–537, 2006.
  21. Training diffusion models with reinforcement learning. ArXiv, abs/2305.13301, 2023.
  22. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952.
  23. Deep bayesian reward learning from preferences. arXiv preprint arXiv:1912.04472, 2019.
  24. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  25. Open problems and fundamental limitations of reinforcement learning from human feedback, 2023.
  26. Angelica Chen. Improving code generation by training with natural language feedback. ArXiv, abs/2303.16749, 2023. URL https://api.semanticscholar.org/CorpusID:257804798.
  27. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  28. On the weaknesses of reinforcement learning for neural machine translation. ArXiv, abs/1907.01752, 2019.
  29. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022.
  30. Deep reinforcement learning from human preferences. ArXiv, abs/1706.03741, 2017.
  31. Training verifiers to solve math word problems. ArXiv, abs/2110.14168, 2021.
  32. Reward model ensembles help mitigate overoptimization, 2023.
  33. Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model, 2023.
  34. Toxicity in chatgpt: Analyzing persona-assigned language models. ArXiv, abs/2304.05335, 2023a. URL https://api.semanticscholar.org/CorpusID:258060002.
  35. Anthropomorphization of ai: Opportunities and risks. arXiv preprint arXiv:2305.14784, 2023b.
  36. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
  37. Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. ArXiv, cs.LG/9905014, 1999. URL https://api.semanticscholar.org/CorpusID:57341.
  38. Raft: Reward ranked finetuning for generative foundation model alignment. ArXiv, abs/2304.06767, 2023.
  39. Glam: Efficient scaling of language models with mixture-of-experts. ArXiv, abs/2112.06905, 2021.
  40. Towards measuring the representation of subjective global opinions in language models. ArXiv, abs/2306.16388, 2023. URL https://api.semanticscholar.org/CorpusID:259275051.
  41. Nano: Nested human-in-the-loop reward learning for few-shot language model control. ArXiv, abs/2211.05750, 2022.
  42. Quality-aware decoding for neural machine translation. ArXiv, abs/2205.00978, 2022.
  43. Bridging the gap: A survey on integrating (human) feedback for natural language generation. ArXiv, abs/2305.00955, 2023. URL https://api.semanticscholar.org/CorpusID:258426970.
  44. Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. ArXiv, abs/2304.03738, 2023. URL https://api.semanticscholar.org/CorpusID:258041203.
  45. Unified pragmatic models for generating and following instructions. In North American Chapter of the Association for Computational Linguistics, 2017. URL https://api.semanticscholar.org/CorpusID:21015570.
  46. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ArXiv, abs/2209.07858, 2022. URL https://api.semanticscholar.org/CorpusID:252355458.
  47. Continually improving extractive qa via human feedback. ArXiv, abs/2305.12473, 2023.
  48. Scaling laws for reward model overoptimization. ArXiv, abs/2210.10760, 2022.
  49. Improving alignment of dialogue agents via targeted human judgements. ArXiv, abs/2209.14375, 2022.
  50. Continuous measurement scales in human evaluation of machine translation. In LAWACL, 2013.
  51. Marek Grzes. Reward shaping in episodic reinforcement learning. In Adaptive Agents and Multi-Agent Systems, 2017. URL https://api.semanticscholar.org/CorpusID:2093019.
  52. The false promise of imitating proprietary llms, 2023.
  53. Bias runs deep: Implicit reasoning biases in persona-assigned llms. arXiv preprint arXiv:2311.04892, 2023.
  54. Inverse reward design. ArXiv, abs/1711.02827, 2017. URL https://api.semanticscholar.org/CorpusID:3805733.
  55. Learning from dialogue after deployment: Feed yourself, chatbot! In Annual Meeting of the Association for Computational Linguistics, 2019.
  56. Semsup: Semantic supervision for simple and scalable zero-shot generalization. 2022. URL https://api.semanticscholar.org/CorpusID:255595954.
  57. Scaling laws and interpretability of learning from repeated data. ArXiv, abs/2205.10487, 2022.
  58. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
  59. Training compute-optimal large language models. ArXiv, abs/2203.15556, 2022.
  60. Reward learning from human preferences and demonstrations in atari. ArXiv, abs/1811.06521, 2018. URL https://api.semanticscholar.org/CorpusID:53424488.
  61. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research, 34:1296 – 1313, 2015. URL https://api.semanticscholar.org/CorpusID:10851113.
  62. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. ArXiv, abs/1907.00456, 2019.
  63. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023.
  64. Scaling laws for neural language models. ArXiv, abs/2001.08361, 2020.
  65. A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925, 2023.
  66. Deep reinforcement learning for sequence-to-sequence models. IEEE Transactions on Neural Networks and Learning Systems, 31:2469–2489, 2018.
  67. Sample efficient text summarization using a single pre-trained transformer. ArXiv, abs/1905.08836, 2019.
  68. Revisiting the weaknesses of reinforcement learning for neural machine translation. ArXiv, abs/2106.08942, 2021.
  69. Aligning large language models through synthetic feedback. ArXiv, abs/2305.13735, 2023.
  70. Understanding the effects of rlhf on llm generalisation and diversity, 2023.
  71. W. B. Knox and P. Stone. Tamer: Training an agent manually via evaluative reinforcement. 2008 7th IEEE International Conference on Development and Learning, pages 292–297, 2008.
  72. Models of human preference for learning reward functions. arXiv preprint arXiv:2206.02231, 2022.
  73. Rl with kl penalties is better viewed as bayesian inference. ArXiv, abs/2205.11275, 2022.
  74. Pretraining language models with human preferences. ArXiv, abs/2302.08582, 2023.
  75. Can neural machine translation be improved with user feedback? ArXiv, abs/1804.05958, 2018a.
  76. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. ArXiv, abs/1805.10627, 2018b.
  77. Specific versus general principles for constitutional ai, 2023.
  78. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
  79. Improving a neural semantic parser by counterfactual learning from human bandit feedback. In Annual Meeting of the Association for Computational Linguistics, 2018.
  80. Deduplicating training data makes language models better. In Annual Meeting of the Association for Computational Linguistics, 2021.
  81. Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2443–2453, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1259. URL https://aclanthology.org/D17-1259.
  82. Dialogue learning with human-in-the-loop. ArXiv, abs/1611.09823, 2016.
  83. Paraphrase generation with deep reinforcement learning. ArXiv, abs/1711.00279, 2017.
  84. Using interactive feedback to improve the accuracy and explainability of question answering systems post-deployment. ArXiv, abs/2204.03025, 2022. URL https://api.semanticscholar.org/CorpusID:248006299.
  85. Let’s verify step by step. ArXiv, abs/2305.20050, 2023.
  86. Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.
  87. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Annual Meeting of the Association for Computational Linguistics, 2004.
  88. Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In North American Chapter of the Association for Computational Linguistics, 2018.
  89. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. ArXiv, abs/1603.08023, 2016.
  90. Chain of hindsight aligns language models with feedback. ArXiv, abs/2302.02676, 2023a.
  91. Second thoughts are best: Learning to re-align with human values from text edits. ArXiv, abs/2301.00355, 2023b.
  92. Statistical rejection sampling improves preference optimization, 2023c.
  93. Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
  94. Roberta: A robustly optimized bert pretraining approach, 2019.
  95. R. Duncan Luce. Individual choice behavior: A theoretical analysis. 1979.
  96. Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651, 2023.
  97. Andrei Andreevich Markov. The theory of algorithms. Trudy Matematicheskogo Instituta Imeni VA Steklova, 42:3–375, 1954.
  98. A joint model of language and perception for grounded attribute learning. In International Conference on Machine Learning, 2012. URL https://api.semanticscholar.org/CorpusID:2408319.
  99. On faithfulness and factuality in abstractive summarization. ArXiv, abs/2005.00661, 2020.
  100. Daniel McFadden. Econometric models of probabilistic choice. 1981.
  101. Sources of hallucination by large language models on inference tasks. ArXiv, abs/2305.14552, 2023. URL https://api.semanticscholar.org/CorpusID:258865517.
  102. The role of baselines in policy gradient optimization. Advances in Neural Information Processing Systems, 35:17818–17830, 2022.
  103. Teaching language models to support answers with verified quotes. ArXiv, abs/2203.11147, 2022.
  104. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
  105. Nash learning from human feedback. ArXiv, abs/2312.00886, 2023. URL https://api.semanticscholar.org/CorpusID:265609682.
  106. DataMUX: Data multiplexing for neural networks. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=UdgtTVTdswg.
  107. Mux-plms: Pre-training language models with data multiplexing. arXiv preprint arXiv:2302.12441, 2023.
  108. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332, 2021.
  109. Deep double descent: Where bigger models and more data hurt. arxiv: 191202292 [cs, stat]. 2019.
  110. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, 1999. URL https://api.semanticscholar.org/CorpusID:5730166.
  111. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
  112. Make the most of prior data: A solution for interactive text summarization with preference feedback. In NAACL-HLT, 2022.
  113. Reinforcement learning for bandit neural machine translation with simulated human feedback. ArXiv, abs/1707.07402, 2017.
  114. Interactive learning from activity description. In ICML, 2021.
  115. Marcus O’Connor. Models of human behaviour and confidence in judgement: A review. International Journal of Forecasting, 5:159–169, 1989.
  116. OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022. URL https://openai.com/blog/chatgpt.
  117. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  118. Intrinsic motivation systems for autonomous mental development. IEEE Trans. Evol. Comput., 11:265–286, 2007. URL https://api.semanticscholar.org/CorpusID:260429077.
  119. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022.
  120. Reward gaming in conditional text generation. ArXiv, abs/2211.08714, 2022.
  121. Don’t blame the annotator: Bias already starts in the annotation instructions. In Conference of the European Chapter of the Association for Computational Linguistics, 2022.
  122. Curiosity-driven exploration by self-supervised prediction. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 488–489, 2017. URL https://api.semanticscholar.org/CorpusID:20045336.
  123. Investigations of performance and bias in human-ai teamwork in hiring, 2022.
  124. Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing, 2022.
  125. Robin L. Plackett. The analysis of permutations. Journal of The Royal Statistical Society Series C-applied Statistics, 24:193–202, 1975.
  126. Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
  127. Improving language understanding by generative pre-training. 2018.
  128. Scaling language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446, 2021.
  129. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  130. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2019.
  131. What matters for on-policy deep actor-critic methods? a large-scale study. In International Conference on Learning Representations, 2021. URL https://api.semanticscholar.org/CorpusID:233340556.
  132. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pages 2586–2591, 2007.
  133. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. ArXiv, abs/2210.01241, 2022.
  134. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. 2023.
  135. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732, 2015.
  136. Reinforcement learning with sparse rewards using guidance from offline demonstration. ArXiv, abs/2202.04628, 2022. URL https://api.semanticscholar.org/CorpusID:246679865.
  137. Diederik M. Roijers. Multi-objective decision-theoretic planning. 2016. URL https://api.semanticscholar.org/CorpusID:124195290.
  138. A survey of multi-objective sequential decision-making. ArXiv, abs/1402.0590, 2013. URL https://api.semanticscholar.org/CorpusID:14478191.
  139. Efficient rlhf: Reducing the memory usage of ppo, 2023.
  140. Whose opinions do language models reflect?, 2023.
  141. Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle. arXiv preprint arXiv:2303.14151, 2023.
  142. Training language models with language feedback. 2022.
  143. Training language models with language feedback at scale. ArXiv, abs/2303.16755, 2023.
  144. Natalie Schluter. The limits of automatic summarisation according to rouge. In Conference of the European Chapter of the Association for Computational Linguistics, 2017.
  145. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.
  146. Bleurt: Learning robust metrics for text generation. In Annual Meeting of the Association for Computational Linguistics, 2020.
  147. Minimum risk training for neural machine translation. ArXiv, abs/1512.02433, 2015.
  148. Toward diverse text generation with inverse reinforcement learning. In International Joint Conference on Artificial Intelligence, 2018.
  149. Reflexion: Language agents with verbal reinforcement learning. 2023. URL https://api.semanticscholar.org/CorpusID:258833055.
  150. The curse of recursion: Training on generated data makes models forget, 2023.
  151. Reward is enough. Artificial Intelligence, 299:103535, 2021.
  152. Large language models encode clinical knowledge. Nature, 620:172 – 180, 2022. URL https://api.semanticscholar.org/CorpusID:255124952.
  153. A long way to go: Investigating length correlations in rlhf, 2023.
  154. Bandit structured prediction for learning from partial feedback in statistical machine translation. ArXiv, abs/1601.04468, 2016.
  155. Preference ranking optimization for human alignment, 2023.
  156. Learning to summarize from human feedback. ArXiv, abs/2009.01325, 2020.
  157. Prumux: Augmenting data multiplexing with model compression. arXiv preprint arXiv:2305.14706, 2023.
  158. Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047, 2023.
  159. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, 16:285–286, 2005. URL https://api.semanticscholar.org/CorpusID:9166388.
  160. Reinforcement learning: An introduction. MIT press, 2018.
  161. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  162. R.S. Sutton. The reward hypothesis. 2004. URL http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html.
  163. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a.
  164. Llama 2: Open foundation and fine-tuned chat models. 2023b.
  165. Solving math word problems with process- and outcome-based feedback. ArXiv, abs/2211.14275, 2022.
  166. Attention is all you need. In NIPS, 2017.
  167. Jailbroken: How does llm safety training fail? ArXiv, abs/2307.02483, 2023. URL https://api.semanticscholar.org/CorpusID:259342528.
  168. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021.
  169. Lilian Weng. Policy gradient algorithms. lilianweng.github.io, 2018. URL https://lilianweng.github.io/posts/2018-04-08-policy-gradient/.
  170. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  171. Recursively summarizing books with human feedback. ArXiv, abs/2109.10862, 2021.
  172. Bloomberggpt: A large language model for finance. ArXiv, abs/2303.17564, 2023a. URL https://api.semanticscholar.org/CorpusID:257833842.
  173. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. ArXiv, abs/2310.00212, 2023b. URL https://api.semanticscholar.org/CorpusID:263334045.
  174. Fine-grained human feedback gives better rewards for language model training. 2023c.
  175. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023d.
  176. Sheared llama: Accelerating language model pre-training via structured pruning.
  177. Structured pruning learns compact and accurate models. In Association for Computational Linguistics (ACL), 2022.
  178. Doremi: Optimizing data mixtures speeds up language model pretraining. 2023.
  179. Learning new skills after deployment: Improving open-domain internet-driven dialogue with human feedback. ArXiv, abs/2208.03270, 2022.
  180. Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, August 2023. ISSN 1467-8535. doi:10.1111/bjet.13370. URL http://dx.doi.org/10.1111/bjet.13370.
  181. Re3: Generating longer stories with recursive reprompting and revision. In Conference on Empirical Methods in Natural Language Processing, 2022a.
  182. TextPruner: A model pruning toolkit for pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 35–43, Dublin, Ireland, May 2022b. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-demo.4. URL https://aclanthology.org/2022.acl-demo.4.
  183. Ranking vs. preference: A comparative study of self-reporting. In Affective Computing and Intelligent Interaction, 2011. URL https://api.semanticscholar.org/CorpusID:48790.
  184. Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. In International Conference on Natural Language Generation, 2019.
  185. AutoTinyBERT: Automatic hyper-parameter optimization for efficient pre-trained language models. pages 5146–5157, 2021.
  186. Rrhf: Rank responses to align language models with human feedback without tears. ArXiv, abs/2304.05302, 2023.
  187. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
  188. The wisdom of hindsight makes language models better instruction followers. ArXiv, abs/2302.05206, 2023.
  189. Calibrating sequence likelihood improves conditional language generation. ArXiv, abs/2210.00045, 2022.
  190. Slic-hf: Sequence likelihood calibration with human feedback. ArXiv, abs/2305.10425, 2023.
  191. Secrets of rlhf in large language models part i: Ppo. 2023.
  192. Lima: Less is more for alignment, 2023.
  193. Wangchunshu Zhou and Ke Xu. Learning to compare for better training and evaluation of open domain natural language generation models. In AAAI Conference on Artificial Intelligence, 2020.
  194. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. ArXiv, abs/2301.11270, 2023a.
  195. Fine-tuning language models with advantage-induced policy alignment. ArXiv, abs/2306.02231, 2023b.
  196. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, 2008.
  197. Fine-tuning language models from human preferences. ArXiv, abs/1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shreyas Chaudhari (19 papers)
  2. Pranjal Aggarwal (9 papers)
  3. Vishvak Murahari (14 papers)
  4. Tanmay Rajpurohit (16 papers)
  5. Ashwin Kalyan (26 papers)
  6. Karthik Narasimhan (82 papers)
  7. Ameet Deshpande (28 papers)
  8. Bruno Castro da Silva (14 papers)
Citations (21)