Bilinear value networks (2204.13695v3)
Abstract: The dominant framework for off-policy multi-goal reinforcement learning involves estimating goal conditioned Q-value function. When learning to achieve multiple goals, data efficiency is intimately connected with the generalization of the Q-function to new goals. The de-facto paradigm is to approximate Q(s, a, g) using monolithic neural networks. To improve the generalization of the Q-function, we propose a bilinear decomposition that represents the Q-value via a low-rank approximation in the form of a dot product between two vector fields. The first vector field, f(s, a), captures the environment's local dynamics at the state s; whereas the second component, {\phi}(s, g), captures the global relationship between the current state and the goal. We show that our bilinear decomposition scheme substantially improves data efficiency, and has superior transfer to out-of-distribution goals compared to prior methods. Empirical evidence is provided on the simulated Fetch robot task-suite and dexterous manipulation with a Shadow hand.
- Hindsight experience replay. In Advances in Neural Information Processing Systems, 2017.
- Universal successor features approximators. In Proceedings of the International Conference on Learning Representations, 2019.
- Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Comput., 5(4):613–624, 1993. ISSN 0899-7667.
- Bootstrap confidence intervals. Statistical science, 11(3):189–228, 1996.
- Addressing function approximation error in actor-critic methods. Proceedings of the International Conference on Machine Learning, 2018.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, 2018.
- Leslie Pack Kaelbling. Learning to achieve goals. In Proceedings of the International Joint Conference on Artificial Intelligence, 1993.
- Deep successor reinforcement learning. ArXiv, abs/1606.02396, 2016.
- Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2016.
- Predictive representations of state. 2001.
- Value function approximation with diffusion wavelets and laplacian eigenfunctions. Advances in Neural Information Processing Systems, 2006.
- Proto-value functions: A laplacian framework for learning representation and control in markov decision processes. Journal of Machine Learning Research, 2007.
- Learning to navigate in complex environments. Proceedings of the International Conference on Representation Learning, 2017.
- Human-level control through deep reinforcement learning. Nature, 2015.
- Maximum entropy gain exploration for long horizon multi-goal reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2020.
- Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
- Universal value function approximators. In Proceedings of the International Conference on Machine Learning, 2015.
- Deterministic policy gradient algorithms. In International Conference on Machine Learning, 2014.
- Learning predictive state representations. Proceedings in the International Conference on Machine Learning, 2003.
- Reinforcement Learning: An Introduction. MIT Press, October 2018. ISBN 9780262352703. URL https://play.google.com/store/books/details?id=uWV0DwAAQBAJ.
- Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2011.
- Q-learning. Machine learning, 8(3):279–292, 1992.