Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning (2405.15194v2)

Published 24 May 2024 in cs.LG and cs.AI

Abstract: Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is further pronounced in case of stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function for all desirable states in the Markov Decision Process (MDP) is challenging, even for domain experts. Given that LLMs have demonstrated impressive performance across a magnitude of natural language tasks, we aim to answer the following question: `Can we obtain heuristics using LLMs for constructing a reward shaping function that can boost an RL agent's sample efficiency?' To this end, we aim to leverage off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP. We further use this LLM-generated plan as a heuristic to construct the reward shaping signal for the downstream RL agent. By characterizing the type of abstraction based on the MDP horizon length, we analyze the quality of heuristics when generated using an LLM, with and without a verifier in the loop. Our experiments across multiple domains with varying horizon length and number of sub-goals from the BabyAI environment suite, Household, Mario, and, Minecraft domain, show 1) the advantages and limitations of querying LLMs with and without a verifier to generate a reward shaping heuristic, and, 2) a significant improvement in the sample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated heuristics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004.
  2. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
  3. Principal-agent reward shaping in mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 9502–9510, 2024.
  4. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024.
  5. Reinforcement learning methods for wordle: A pomdp/adaptive control approach. arXiv preprint arXiv:2211.10298, 2022.
  6. Towards llm-guided causal explainability for black-box text classifiers. 2024.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  8. Beyond sparse rewards: Enhancing reinforcement learning with language model critique in text generation, 2024.
  9. Grounding large language models in interactive environments with online reinforcement learning. In International Conference on Machine Learning, pages 3676–3713. PMLR, 2023.
  10. Babyai: A platform to study the sample efficiency of grounded language learning. arXiv preprint arXiv:1810.08272, 2018.
  11. Theoretical considerations of potential-based reward shaping for multi-agent systems. In Tenth International Conference on Autonomous Agents and Multi-Agent Systems, pages 225–232. ACM, 2011.
  12. An empirical study of potential-based reward shaping and advice in complex, multi-agent systems. Advances in Complex Systems, 14(02):251–278, 2011.
  13. Dynamic potential-based reward shaping. In 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012), pages 433–440. IFAAMAS, 2012.
  14. Thomas G Dietterich et al. The maxq method for hierarchical reinforcement learning. In ICML, volume 98, pages 118–126, 1998.
  15. Guiding pretraining in reinforcement learning with large language models. In International Conference on Machine Learning, pages 8657–8677. PMLR, 2023.
  16. Potential-based reward shaping for finite horizon online pomdp planning. Autonomous Agents and Multi-Agent Systems, 30:403–445, 2016.
  17. Potential based reward shaping for hierarchical reinforcement learning. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  18. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
  19. Planning with abstract markov decision processes. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 27, pages 480–488, 2017.
  20. Marek Grzes. Reward shaping in episodic reinforcement learning. 2017.
  21. Learning potential for reward shaping in reinforcement learning with tile coding. In Proceedings AAMAS 2008 Workshop on Adaptive and Learning Agents and Multi-Agent Systems (ALAMAS-ALAg 2008), pages 17–23, 2008.
  22. Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  23. Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817, 2024.
  24. A game-based abstraction-refinement framework for markov decision processes. Formal Methods in System Design, 36:246–280, 2010.
  25. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861, 2023.
  26. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016.
  27. Reward design with language models. arXiv preprint arXiv:2303.00001, 2023.
  28. Nearly deterministic abstractions of markov decision processes. In AAAI/IAAI, pages 260–266, 2002.
  29. The influence of reward on the speed of reinforcement learning: An analysis of shaping. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 440–447, 2003.
  30. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  31. Llamacare: An instruction fine-tuned large language model for clinical nlp. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10632–10641, 2024.
  32. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  33. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  34. Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
  35. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  36. Bhaskara Marthi. Automatic shaping and decomposition of reward functions. In Proceedings of the 24th International Conference on Machine learning, pages 601–608, 2007.
  37. Solving markov decision processes with partial state abstractions. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 813–819. IEEE, 2021.
  38. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287, 1999.
  39. Towards interpretable hate speech detection using large language model-extracted rationales. arXiv preprint arXiv:2403.12403, 2024.
  40. Mapping language models to grounded conceptual spaces. In International conference on learning representations, 2021.
  41. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR, 2017.
  42. Stuart Russell. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103, 1998.
  43. Artificial intelligence: a modern approach. Pearson, 2016.
  44. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  45. On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024.
  46. Designing non-greedy reinforcement learning agents with diminishing reward shaping. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 297–302, 2018.
  47. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  48. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36:75993–76005, 2023.
  49. Theory of mind abilities of large language models in human-robot interaction: An illusion? In Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, pages 36–45, 2024.
  50. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  51. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  52. Eric Wiewiora. Potential-based shaping and q-value initialization are equivalent. Journal of Artificial Intelligence Research, 19:205–208, 2003.
  53. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36, 2024.
  54. Unveiling the generalization power of fine-tuned large language models. arXiv preprint arXiv:2403.09162, 2024.
  55. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  56. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023.
  57. Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145, 2023.
  58. Reward shaping via meta-learning. arXiv preprint arXiv:1901.09330, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Siddhant Bhambri (16 papers)
  2. Amrita Bhattacharjee (24 papers)
  3. Huan Liu (283 papers)
  4. Subbarao Kambhampati (126 papers)
  5. Durgesh Kalwar (6 papers)
  6. Lin Guan (25 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com