Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint (2402.14244v2)

Published 22 Feb 2024 in cs.AI, cs.HC, and cs.LG

Abstract: Hierarchical reinforcement learning (HRL) provides a promising solution for complex tasks with sparse rewards of intelligent agents, which uses a hierarchical framework that divides tasks into subgoals and completes them sequentially. However, current methods struggle to find suitable subgoals for ensuring a stable learning process. Without additional guidance, it is impractical to rely solely on exploration or heuristics methods to determine subgoals in a large goal space. To address the issue, We propose a general hierarchical reinforcement learning framework incorporating human feedback and dynamic distance constraints (MENTOR). MENTOR acts as a "mentor", incorporating human feedback into high-level policy learning, to find better subgoals. As for low-level policy, MENTOR designs a dual policy for exploration-exploitation decoupling respectively to stabilize the training. Furthermore, although humans can simply break down tasks into subgoals to guide the right learning direction, subgoals that are too difficult or too easy can still hinder downstream learning efficiency. We propose the Dynamic Distance Constraint (DDC) mechanism dynamically adjusting the space of optional subgoals. Thus MENTOR can generate subgoals matching the low-level policy learning process from easy to hard. Extensive experiments demonstrate that MENTOR uses a small amount of human feedback to achieve significant improvement in complex tasks with sparse rewards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  2. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  3. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  4. Option discovery using deep skill chaining. In International Conference on Learning Representations, 2019.
  5. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems, 29, 2016.
  6. Hierarchical learning from human preferences and curiosity. Applied Intelligence, pages 1–21, 2022.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  10. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, pages 1317–1327. PMLR, 2020.
  11. Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000.
  12. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
  13. Learning to reach goals without reinforcement learning. 2019.
  14. When waiting is not an option: Learning options with a deliberation cost. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  15. Dynamical distance learning for semi-supervised and unsupervised skill discovery. arXiv preprint arXiv:1907.08225, 2019.
  16. Hierarchical reinforcement learning: A survey and open research challenges. Machine Learning and Knowledge Extraction, 4(1):172–221, 2022.
  17. Finding options that minimize planning time. In International Conference on Machine Learning, pages 3120–3129. PMLR, 2019.
  18. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021a.
  19. B-pref: Benchmarking preference-based reinforcement learning. arXiv preprint arXiv:2111.03026, 2021b.
  20. Learning multi-level hierarchies with hindsight. arXiv preprint arXiv:1712.00948, 2017.
  21. Relay hindsight experience replay: Self-guided continual reinforcement learning for sequential object manipulation tasks with sparse rewards. Neurocomputing, 557:126620, 2023.
  22. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pages 6820–6829. PMLR, 2020.
  23. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  25. Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050, 2022.
  26. Controllability-aware unsupervised skill discovery. arXiv preprint arXiv:2302.05103, 2023.
  27. Reinforcement learning with hierarchies of machines. Advances in neural information processing systems, 10, 1997.
  28. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR, 2017.
  29. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
  30. Sqil: Imitation learning via reinforcement learning with sparse rewards. arXiv preprint arXiv:1905.11108, 2019.
  31. Efficient and optimal algorithms for contextual dueling bandits under realizability. In International Conference on Algorithmic Learning Theory, pages 968–994. PMLR, 2022.
  32. Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR, 2015.
  33. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. Journal of Machine Learning Research, 17(58):1–47, 2016.
  34. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
  35. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
  36. Adjacency constraint for efficient hierarchical reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4152–4166, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xinglin Zhou (1 paper)
  2. Yifu Yuan (19 papers)
  3. Shaofu Yang (5 papers)
  4. Jianye Hao (185 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets