Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras (2404.14064v2)
Abstract: The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that is robust to a reduction in the number of cameras to generalise to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
- Visual-policy learning through multi-camera view to single-camera view knowledge distillation for robot manipulation tasks. IEEE Robotics and Automation Letters, 9:691–698, 2023.
- Contrastive behavioral similarity embeddings for generalization in reinforcement learning. In International Conference on Learning Representations, 2021.
- Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1247–1255. PMLR, 2013.
- Learning representations by maximizing mutual information across views. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
- An actor-critic-attention mechanism for deep reinforcement learning in multi-view environments. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), 2019.
- A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1597–1607. PMLR, 2020.
- Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019.
- Conditional mutual information for disentangled representations in reinforcement learning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023a.
- Temporal disentanglement of representations for improved generalisation in reinforcement learning. In International Conference on Learning Representations, 2023b.
- Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia, MULTIMEDIA ’14, pp. 7–16. Association for Computing Machinery, 2014.
- Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NeurIPS 2013), 2013.
- panda-gym: Open-source goal-conditioned environments for robotic learning. 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at NeurIPS, 2021.
- Multimodal masked autoencoders learn transferable representations. ArXiv, abs/2205.14204, 2022.
- Image-to-image translation for cross-domain disentanglement. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
- Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), volume 80 of Proceedings of Machine Learning Research, pp. 1861–1870. PMLR, 2018.
- Generalization in reinforcement learning by soft data augmentation. In International Conference on Robotics and Automation, 2021.
- Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735. IEEE Computer Society, 2020.
- DARLA: Improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), volume 70 of Proceedings of Machine Learning Research, pp. 1480–1490. PMLR, 2017.
- A benchmark for interpretability methods in deep neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
- Haruo Hosoya. Group-based learning of disentangled representations with generalizability for novel contents. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 2506–2513. International Joint Conferences on Artificial Intelligence Organization, 2019.
- Learning to decompose and disentangle representations for video prediction. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
- Vision-based manipulators need to also see from their hands. In International Conference on Learning Representations (ICLR 2022), 2022.
- Self-supervised multi-view disentanglement for expansion of visual collections. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM ’23, pp. 841–849. Association for Computing Machinery, 2023.
- Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation. IEEE Robotics and Automation Letters, 7:3046–3053, 2022.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:664–676, 2014.
- Conan: Contrastive fusion networks for multi-view clustering. In 2021 IEEE International Conference on Big Data (Big Data), pp. 653–660, 2021.
- Disentangling multi-view representations be- yond inductive bias. In Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), 2023.
- Captum: A unified and generic model interpretability library for pytorch. arXiv:2009.07896, 2020.
- CURL: Contrastive unsupervised representations for reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5639–5650. PMLR, 2020.
- Predictive information accelerates learning in rl. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS 2020), 2020.
- Multimedia content processing through cross-modal association. In Proceedings of the Eleventh ACM International Conference on Multimedia, MULTIMEDIA ’03, pp. 604–611. Association for Computing Machinery, 2003. ISBN 1581137222.
- A survey of multi-view representation learning. IEEE Transactions on Knowledge & Data Engineering, 31:1863–1883, 2019.
- Deep reinforcement and infomax learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020.
- Multi-horizon representations with hierarchical forward models for reinforcement learning. Transactions on Machine Learning Research (TMLR), 2024.
- Interpretable representation learning from temporal multi-view data. In Proceedings of The 14th Asian Conference on Machine Learning, volume 189 of Proceedings of Machine Learning Research, pp. 864–879. PMLR, 2023.
- Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), volume 70 of Proceedings of Machine Learning Research, pp. 3319–3328. PMLR, 2017.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012.
- Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
- Unsupervised feature learning via non-parametric instance discrimination. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733–3742, 2018.
- Joint deep multi-view learning for image clustering. IEEE Transactions on Knowledge and Data Engineering, 33(11):3594–3606, 2021.
- Self-supervised deep correlational multi-view clustering. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, 2021.
- Multi-vae: Learning disentangled view-common and view-peculiar visual representations for multi-view clustering. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9214–9223, 2021.
- Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021a.
- Improving sample efficiency in model-free reinforcement learning from images. In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), 2021b.
- Learning multiple views with orthogonal denoising autoencoders. In Conference on Multimedia Modeling, 2016.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pp. 1094–1100. PMLR, 2020.
- Mhairi Dunion (5 papers)
- Stefano V. Albrecht (73 papers)