Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
94 tokens/sec
Gemini 2.5 Pro Premium
55 tokens/sec
GPT-5 Medium
38 tokens/sec
GPT-5 High Premium
24 tokens/sec
GPT-4o
106 tokens/sec
DeepSeek R1 via Azure Premium
98 tokens/sec
GPT OSS 120B via Groq Premium
518 tokens/sec
Kimi K2 via Groq Premium
188 tokens/sec
2000 character limit reached

Backward Learning for Goal-Conditioned Policies (2312.05044v2)

Published 8 Dec 2023 in cs.LG and cs.AI

Abstract: Can we learn policies in reinforcement learning without rewards? Can we learn a policy just by trying to reach a goal state? We answer these questions positively by proposing a multi-step procedure that first learns a world model that goes backward in time, secondly generates goal-reaching backward trajectories, thirdly improves those sequences using shortest path finding algorithms, and finally trains a neural network policy by imitation learning. We evaluate our method on a deterministic maze environment where the observations are $64\times 64$ pixel bird's eye images and can show that it consistently reaches several goals.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Dana H Ballard. Modular learning in neural networks. In Proceedings of the sixth National Conference on artificial intelligence-volume 1, pp.  279–284, 1987.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Actionable models: Unsupervised offline reinforcement learning of robotic skills. arXiv preprint arXiv:2104.07749, 2021.
  6. Forethought and hindsight in credit assignment. Advances in Neural Information Processing Systems, 33:2270–2281, 2020.
  7. Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271, 1959.
  8. First return, then explore. Nature, 590(7847):580–586, 2021.
  9. Forward-backward reinforcement learning. arXiv preprint arXiv:1803.10227, 2018.
  10. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
  11. Recall traces: Backtracking models for efficient reinforcement learning. arXiv preprint arXiv:1804.00379, 2018.
  12. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019.
  13. World models. arXiv preprint arXiv:1803.10122, 2018.
  14. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
  15. Deep hierarchical planning from pixels. Advances in Neural Information Processing Systems, 35:26091–26104, 2022.
  16. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  17. Time-myopic go-explore: Learning a state representation for the go-explore paradigm. arXiv preprint arXiv:2301.05635, 2023.
  18. Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science, 2(5):580–593, 2011.
  19. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
  20. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  21. Bidirectional model-based policy optimization. In International Conference on Machine Learning, pp. 5618–5627. PMLR, 2020.
  22. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  23. Hierarchical planning through goal-conditioned offline reinforcement learning. IEEE Robotics and Automation Letters, 7(4):10216–10223, 2022.
  24. Backward imitation and forward reinforcement learning via bi-directional model rollouts. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  9040–9047. IEEE, 2022.
  25. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5):1–35, 2021.
  26. Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023.
  27. Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
  28. Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp.  222–227, 1991.
  29. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  30. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  31. Reinforcement learning: An introduction. MIT press, 2018.
  32. Offline reinforcement learning with reverse model-based imagination. Advances in Neural Information Processing Systems, 34:29420–29432, 2021a.
  33. Learning to weight imperfect demonstrations. In International Conference on Machine Learning, pp. 10961–10970. PMLR, 2021b.
  34. Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pp. 6818–6827. PMLR, 2019.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets