Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The RL/LLM Taxonomy Tree: Reviewing Synergies Between Reinforcement Learning and Large Language Models (2402.01874v1)

Published 2 Feb 2024 in cs.CL, cs.AI, cs.LG, and cs.RO

Abstract: In this work, we review research studies that combine Reinforcement Learning (RL) and LLMs, two areas that owe their momentum to the development of deep neural networks. We propose a novel taxonomy of three main classes based on the way that the two model types interact with each other. The first class, RL4LLM, includes studies where RL is leveraged to improve the performance of LLMs on tasks related to Natural Language Processing. L4LLM is divided into two sub-categories depending on whether RL is used to directly fine-tune an existing LLM or to improve the prompt of the LLM. In the second class, LLM4RL, an LLM assists the training of an RL model that performs a task that is not inherently related to natural language. We further break down LLM4RL based on the component of the RL training framework that the LLM assists or replaces, namely reward shaping, goal generation, and policy function. Finally, in the third class, RL+LLM, an LLM and an RL agent are embedded in a common planning framework without either of them contributing to training or fine-tuning of the other. We further branch this class to distinguish between studies with and without natural language feedback. We use this taxonomy to explore the motivations behind the synergy of LLMs and RL and explain the reasons for its success, while pinpointing potential shortcomings and areas where further research is needed, as well as alternative methodologies that serve the same goal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (151)
  1. Gpt-4 technical report. Technical report, OpenAI, 2023. URL https://openai.com/contributions/gpt-4v.
  2. P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, page 1, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 1581138385. doi: 10.1145/1015330.1015430. URL https://doi.org/10.1145/1015330.1015430.
  3. Do as i can, not as i say: Grounding language in robotic affordances, 2022.
  4. AnyRobotics. Anymal, 2023. URL https://www.anybotics.com/robotics/anymal/.
  5. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017.
  6. A survey on intrinsic motivation in reinforcement learning, 2019.
  7. Evolutionary reinforcement learning: A survey. Intelligent Computing, 2:0025, 2023. doi: 10.34133/icomputing.0025. URL https://spj.science.org/doi/abs/10.34133/icomputing.0025.
  8. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
  9. Constitutional ai: Harmlessness from ai feedback, 2022b.
  10. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020. ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2019.103216. URL https://www.sciencedirect.com/science/article/pii/S0004370219300116.
  11. A Survey of Meta-Reinforcement Learning. arXiv e-prints, art. arXiv:2301.08028, Jan. 2023. doi: 10.48550/arXiv.2301.08028.
  12. R. Bellman. A markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957. ISSN 0022-2518.
  13. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 41–48, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380. URL https://doi.org/10.1145/1553374.1553380.
  14. BitCraze. Crazyflie, 2023. URL https://www.bitcraze.io/products/crazyflie-2-1/.
  15. Openai gym, 2016.
  16. Language models are few-shot learners, 2020a.
  17. Language models are few-shot learners, 2020b.
  18. E. Cambria and B. White. Jumping nlp curves: A review of natural language processing research [review article]. IEEE Computational Intelligence Magazine, 9(2):48–57, 2014. doi: 10.1109/MCI.2014.2307227.
  19. The life cycle of knowledge in big language models: A survey, 2023a.
  20. Reinforcement learning for generative ai: A survey. arXiv preprint arXiv:2308.14328, 2023b.
  21. Temporal video-language alignment network for reward shaping in reinforcement learning, 2023c.
  22. Grounding large language models in interactive environments with online reinforcement learning, 2023.
  23. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  24. Decision transformer: Reinforcement learning via sequence modeling, 2021a.
  25. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  26. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021b.
  27. Hardware conditioned policies for multi-robot transfer learning, 2019.
  28. Intrinsically motivated reinforcement learning. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 17. MIT Press, 2004. URL https://proceedings.neurips.cc/paper_files/paper/2004/file/4be5a36cbaca8ab9d2066debfe4e65c1-Paper.pdf.
  29. Babyai: A platform to study the sample efficiency of grounded language learning, 2019.
  30. Lmpriors: Pre-trained language models as task-specific priors, 2022.
  31. K. Chowdhary and K. Chowdhary. Natural language processing. Fundamentals of artificial intelligence, pages 603–649, 2020.
  32. Palm: Scaling language modeling with pathways, 2022.
  33. Training verifiers to solve math word problems, 2021.
  34. Collaborating with language models for embodied reasoning, 2023.
  35. G. DeepMind. pycolab. URL https://github.com/google-deepmind/pycolab.
  36. Rlprompt: Optimizing discrete text prompts with reinforcement learning, 2022.
  37. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  38. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  39. Shortcut learning of large language models in natural language understanding, 2023a.
  40. W. Du and S. Ding. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artificial Intelligence Review, 54:3215–3238, 2021.
  41. Guiding pretraining in reinforcement learning with large language models, 2023b.
  42. J. Eschmann. Reward Function Design in Reinforcement Learning, pages 25–33. Springer International Publishing, Cham, 2021. ISBN 978-3-030-41188-6. doi: 10.1007/978-3-030-41188-6_3. URL https://doi.org/10.1007/978-3-030-41188-6_3.
  43. Large language models for software engineering: Survey and open problems, 2023.
  44. An introduction to deep reinforcement learning. Foundations and Trends® in Machine Learning, 11(3-4):219–354, 2018.
  45. S. Fujimoto and S. S. Gu. A minimalist approach to offline reinforcement learning, 2021.
  46. Generalized decision transformer for offline hindsight information matching, 2022.
  47. Bias and fairness in large language models: A survey, 2023.
  48. Reinforcement learning for mobile robotics exploration: A survey. IEEE Transactions on Neural Networks and Learning Systems, 34(8):3796–3810, 2023. doi: 10.1109/TNNLS.2021.3124466.
  49. Efficient unsupervised sentence compression by fine-tuning transformers with reinforcement learning, 2022.
  50. Using natural language for reward shaping in reinforcement learning, 2019.
  51. S. Gronauer and K. Diepold. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, pages 1–49, 2022.
  52. Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.
  53. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023.
  54. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.
  55. The franka emika robot: A reference platform for robotics research and education. IEEE Robotics & Automation Magazine, 29(2):46–64, 2022. doi: 10.1109/MRA.2021.3138382.
  56. H. Hu and D. Sadigh. Language instructed reinforcement learning for human-ai coordination, 2023.
  57. Aligning language models with offline reinforcement learning from human feedback. arXiv preprint arXiv:2308.12050, 2023.
  58. J. Huang and K. C.-C. Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  59. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  60. Inner monologue: Embodied reasoning through planning with language models, 2022.
  61. Reinforcement learning as one big sequence modeling problem. CoRR, abs/2106.02039, 2021. URL https://arxiv.org/abs/2106.02039.
  62. Language-informed transfer learning for embodied household activities, 2023.
  63. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
  64. Housekeep: Tidying virtual households using commonsense reasoning. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, pages 355–373, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19842-7.
  65. J. Kim and B. Lee. Ai-augmented surveys: Leveraging large language models for opinion prediction in nationally representative surveys, 2023.
  66. Conservative q-learning for offline reinforcement learning, 2020.
  67. Reward design with language models, 2023.
  68. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  69. Deal or no deal? end-to-end learning for negotiation dialogues, 2017.
  70. Large language models for generative recommendation: A survey and visionary discussions, 2023a.
  71. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023b.
  72. How can recommender systems benefit from large language models: A survey, 2023.
  73. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023a.
  74. Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology, 1(2):100017, sep 2023b. doi: 10.1016/j.metrad.2023.100017. URL https://doi.org/10.1016%2Fj.metrad.2023.100017.
  75. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  76. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021. ISSN 0305-0548. doi: https://doi.org/10.1016/j.cor.2021.105400. URL https://www.sciencedirect.com/science/article/pii/S0305054821001660.
  77. Pointer sentinel mixture models, 2016.
  78. Augmented language models: a survey, 2023.
  79. Recent advances in natural language processing via large pre-trained language models: A survey, 2021.
  80. Playing atari with deep reinforcement learning, 2013.
  81. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. URL https://api.semanticscholar.org/CorpusID:205242740.
  82. A survey of graphs in natural language processing. Natural Language Engineering, 21(5):665–698, 2015.
  83. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models, 2021.
  84. Reinforcement learning on graphs: A survey. IEEE Transactions on Emerging Topics in Computational Intelligence, 2023.
  85. OpenAI. Chatgpt, 2023a. URL https://chat.openai.com/chat.
  86. OpenAI. Gpt-3.5, 2023b. URL https://platform.openai.com/docs/models/gpt-3-5.
  87. OpenAI. Gpt-4 technical report, 2023c.
  88. A survey on natural language processing for fake news detection. arXiv preprint arXiv:1811.00770, 2018.
  89. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020.
  90. Training language models to follow instructions with human feedback, 2022.
  91. S. Padakandla. A survey of reinforcement learning algorithms for dynamically varying environments. ACM Comput. Surv., 54(6), jul 2021. ISSN 0360-0300. doi: 10.1145/3459991. URL https://doi.org/10.1145/3459991.
  92. Unifying large language models and knowledge graphs: A roadmap, 2023.
  93. Are nlp models really able to solve simple math word problems?, 2021.
  94. Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv., 54(5), jun 2021. ISSN 0360-0300. doi: 10.1145/3453160. URL https://doi.org/10.1145/3453160.
  95. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, 2019.
  96. Red teaming language models with language models. CoRR, abs/2202.03286, 2022. URL https://arxiv.org/abs/2202.03286.
  97. J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
  98. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  99. Automatic prompt optimization with "gradient descent" and beam search, 2023.
  100. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897, 2020.
  101. Exploiting contextual structure to generate useful auxiliary tasks, 2023.
  102. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  103. Learning transferable visual models from natural language supervision, 2021.
  104. Scaling language models: Methods, analysis & insights from training gopher, 2022.
  105. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization, 2023.
  106. Can wikipedia help offline reinforcement learning?, 2022.
  107. Syndicom: Improving conversational commonsense with error-injection and natural language feedback. arXiv preprint arXiv:2309.10015, 2023.
  108. S. Roy and D. Roth. Solving general arithmetic word problems, 2016.
  109. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020.
  110. Proximal policy optimization algorithms, 2017.
  111. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023.
  112. I. Solaiman and C. Dennison. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021.
  113. Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics. arXiv preprint arXiv:2309.06687, 2023.
  114. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  115. H. Sun. Offline prompt evaluation and optimization with inverse reinforcement learning. arXiv preprint arXiv:2309.06553, 2023a.
  116. H. Sun. Reinforcement learning in the era of llms: What is essential? what is needed? an rl perspective on rlhf, prompting, and beyond, 2023b.
  117. R. Sutton and A. Barto. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, 9(5):1054–1054, 1998. doi: 10.1109/TNN.1998.712192.
  118. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018. ISBN 0262039249.
  119. Lamda: Language models for dialog applications, 2022.
  120. Natural language processing advancements by deep learning: A survey. arXiv preprint arXiv:2003.01200, 2020.
  121. Teaching multiple tasks to an rl agent using ltl. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18, page 452–461, Richland, SC, 2018. International Foundation for Autonomous Agents and Multiagent Systems.
  122. Deep reinforcement learning with double q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30, 09 2015. doi: 10.1609/aaai.v30i1.10295.
  123. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  124. Software testing with large language model: Survey, landscape, and vision, 2023a.
  125. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023b.
  126. Knowledge editing for large language models: A survey, 2023c.
  127. Large-scale multi-modal pre-trained models: A comprehensive survey, 2023d.
  128. Aligning large language models with human: A survey, 2023e.
  129. Dueling network architectures for deep reinforcement learning. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1995–2003, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/wangf16.html.
  130. Emergent abilities of large language models, 2022.
  131. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  132. G. Weiss. Dynamic programming and markov processes. ronald a. howard. technology press and wiley, new york, 1960. viii + 136 pp. illus. $5.75. Science, 132(3428):667–667, 1960. doi: 10.1126/science.132.3428.667.a. URL https://www.science.org/doi/abs/10.1126/science.132.3428.667.a.
  133. A survey on clinical natural language processing in the united kingdom from 2007 to 2022. NPJ digital medicine, 5(1):186, 2022.
  134. A survey on large language models for recommendation, 2023a.
  135. Behavior regularized offline reinforcement learning, 2019.
  136. Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning, 2023b.
  137. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  138. Text2reward: Automated dense reward function generation for reinforcement learning. arXiv preprint arXiv:2309.11489, 2023.
  139. Harnessing the power of llms in practice: A survey on chatgpt and beyond, 2023.
  140. Reinforcement learning in healthcare: A survey. ACM Comput. Surv., 55(1), nov 2021a. ISSN 0360-0300. doi: 10.1145/3477600. URL https://doi.org/10.1145/3477600.
  141. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021b.
  142. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks, 2023.
  143. Survey of natural language processing techniques in bioinformatics. Computational and mathematical methods in medicine, 2015, 2015.
  144. Instruction tuning for large language models: A survey, 2023.
  145. Tempera: Test-time prompting via reinforcement learning, 2022.
  146. W. Zhang and Z. Lu. Rladapter: Bridging large language models to reinforcement learning in open worlds, 2023.
  147. Explainability for large language models: A survey, 2023a.
  148. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023b.
  149. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023a.
  150. Large language models are human-level prompt engineers, 2023b.
  151. Large language models for information retrieval: A survey, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Moschoula Pternea (1 paper)
  2. Prerna Singh (9 papers)
  3. Abir Chakraborty (4 papers)
  4. Yagna Oruganti (1 paper)
  5. Mirco Milletari (7 papers)
  6. Sayli Bapat (1 paper)
  7. Kebei Jiang (5 papers)
Citations (3)
Youtube Logo Streamline Icon: https://streamlinehq.com