2000 character limit reached
Backward Learning for Goal-Conditioned Policies (2312.05044v2)
Published 8 Dec 2023 in cs.LG and cs.AI
Abstract: Can we learn policies in reinforcement learning without rewards? Can we learn a policy just by trying to reach a goal state? We answer these questions positively by proposing a multi-step procedure that first learns a world model that goes backward in time, secondly generates goal-reaching backward trajectories, thirdly improves those sequences using shortest path finding algorithms, and finally trains a neural network policy by imitation learning. We evaluate our method on a deterministic maze environment where the observations are $64\times 64$ pixel bird's eye images and can show that it consistently reaches several goals.
- Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Dana H Ballard. Modular learning in neural networks. In Proceedings of the sixth National Conference on artificial intelligence-volume 1, pp. 279–284, 1987.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Actionable models: Unsupervised offline reinforcement learning of robotic skills. arXiv preprint arXiv:2104.07749, 2021.
- Forethought and hindsight in credit assignment. Advances in Neural Information Processing Systems, 33:2270–2281, 2020.
- Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271, 1959.
- First return, then explore. Nature, 590(7847):580–586, 2021.
- Forward-backward reinforcement learning. arXiv preprint arXiv:1803.10227, 2018.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
- Recall traces: Backtracking models for efficient reinforcement learning. arXiv preprint arXiv:1804.00379, 2018.
- Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019.
- World models. arXiv preprint arXiv:1803.10122, 2018.
- Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Deep hierarchical planning from pixels. Advances in Neural Information Processing Systems, 35:26091–26104, 2022.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Time-myopic go-explore: Learning a state representation for the go-explore paradigm. arXiv preprint arXiv:2301.05635, 2023.
- Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science, 2(5):580–593, 2011.
- Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
- Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
- Bidirectional model-based policy optimization. In International Conference on Machine Learning, pp. 5618–5627. PMLR, 2020.
- Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
- Hierarchical planning through goal-conditioned offline reinforcement learning. IEEE Robotics and Automation Letters, 7(4):10216–10223, 2022.
- Backward imitation and forward reinforcement learning via bi-directional model rollouts. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9040–9047. IEEE, 2022.
- Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5):1–35, 2021.
- Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023.
- Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
- Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227, 1991.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Reinforcement learning: An introduction. MIT press, 2018.
- Offline reinforcement learning with reverse model-based imagination. Advances in Neural Information Processing Systems, 34:29420–29432, 2021a.
- Learning to weight imperfect demonstrations. In International Conference on Machine Learning, pp. 10961–10970. PMLR, 2021b.
- Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pp. 6818–6827. PMLR, 2019.