Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Can LLM Guide RL? A Value-Based Approach (2402.16181v1)

Published 25 Feb 2024 in cs.LG and cs.AI

Abstract: Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in LLMs have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  2. Grounding language to autonomously-acquired skills via goal generation. arXiv preprint arXiv:2006.07185.
  3. Bialystok, E. (1978). A theoretical model of second language learning 1. Language learning, 28 69–83.
  4. Planning as heuristic search. Artificial Intelligence, 129 5–33.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  6. Asking before action: Gather information in embodied decision making with language models. arXiv preprint arXiv:2305.15695.
  7. Guided search for task and motion plans using learned heuristics. In 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE.
  8. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  9. Reinforcement learning for classical planning: Viewing heuristics as dense reward generators. In Proceedings of the International Conference on Automated Planning and Scheduling, vol. 32.
  10. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992.
  11. Improving long-horizon imitation through instruction prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37.
  12. The ff planning system: Fast plan generation through heuristic search. Journal of Artificial Intelligence Research, 14 253–302.
  13. Language instructed reinforcement learning for human-ai coordination. arXiv preprint arXiv:2304.07297.
  14. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  15. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning. PMLR.
  16. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973.
  17. Ivanova, A. A. (2023). Running cognitive evaluations on large language models: The do’s and the don’ts. arXiv preprint arXiv:2312.01276.
  18. Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems, 32.
  19. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  20. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  21. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35 31199–31212.
  22. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE.
  23. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153.
  24. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.
  25. Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency. arXiv preprint arXiv:2309.17382.
  26. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
  27. Grounding language in play. arXiv preprint arXiv:2005.07648, 3.
  28. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648.
  29. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931.
  30. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627.
  31. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning. PMLR.
  32. Think before you act: Unified policy for interleaving language reasoning with actions. arXiv preprint arXiv:2304.11063.
  33. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  34. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798.
  35. Understanding the capabilities of large language models for automated planning. arXiv preprint arXiv:2305.16151.
  36. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR.
  37. Can wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122.
  38. Mathematical discoveries from program search with large language models. Nature 1–3.
  39. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  40. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379.
  41. Skill induction and planning with latent language. arXiv preprint arXiv:2110.01517.
  42. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366.
  43. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
  44. Generalized planning in pddl domains with pretrained large language models. arXiv preprint arXiv:2305.11014.
  45. Embodied bert: A transformer model for embodied, language-guided visual task completion. arXiv preprint arXiv:2108.04927.
  46. Adaplanner: Adaptive planning from feedback with language models. arXiv preprint arXiv:2305.16653.
  47. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12.
  48. Can large language models really improve by self-critiquing their own plans? arXiv preprint arXiv:2310.08118.
  49. On the planning abilities of large language models (a critical investigation with a proposed benchmark). arXiv preprint arXiv:2302.06706.
  50. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
  51. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  52. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
  53. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35 24824–24837.
  54. Translating natural language to planning goals with large-language models. arXiv preprint arXiv:2302.05128.
  55. Intercode: Standardizing and benchmarking interactive coding with execution feedback. arXiv preprint arXiv:2306.14898.
  56. Leandojo: Theorem proving with retrieval-augmented language models. arXiv preprint arXiv:2306.15626.
  57. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  58. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  59. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582.
  60. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563.
  61. Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188.
  62. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34 23634–23651.
  63. Zhang, S. (2022). Conservative dual policy optimization for efficient model-based reinforcement learning. Advances in Neural Information Processing Systems, 35 25450–25463.
  64. Model-based reparameterization policy gradient methods: Theory and practical algorithms. Advances in Neural Information Processing Systems, 36.
  65. Self-contrast: Better reflection through inconsistent solving perspectives. arXiv preprint arXiv:2401.02009.
  66. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Shenao Zhang (16 papers)
  2. Sirui Zheng (5 papers)
  3. Shuqi Ke (5 papers)
  4. Zhihan Liu (22 papers)
  5. Wanxin Jin (25 papers)
  6. Jianbo Yuan (33 papers)
  7. Yingxiang Yang (14 papers)
  8. Hongxia Yang (130 papers)
  9. Zhaoran Wang (164 papers)
Citations (5)