DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning (2402.18137v2)
Abstract: Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progressions; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/
- Compositional foundation models for hierarchical planning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Preference-based policy learning. In Proceedings of the 2011th European Conference on Machine Learning and Knowledge Discovery in Databases-Volume Part I, pp. 12–27, 2011.
- Robotic offline rl from internet videos via value-function pre-training. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Zero-shot robotic manipulation with pre-trained image-editing diffusion models. In NeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023.
- Can foundation models perform zero-shot task specification for robot manipulation? In Learning for Dynamics and Control Conference, pp. 893–905. PMLR, 2022.
- Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pp. 720–736, 2018.
- Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023a.
- Video language planning. arXiv preprint arXiv:2310.10625, 2023b.
- The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp. 5842–5850, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022.
- Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- Query-policy misalignment in preference-based reinforcement learning. arXiv preprint arXiv:2305.17400, 2023.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp. 991–1002. PMLR, 2022.
- Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023.
- Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14829–14838, 2022.
- Scsampler: Sampling salient clips from video for efficient action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6232–6242, 2019.
- Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022.
- Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020a.
- Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp. 5639–5650. PMLR, 2020b.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, pp. 6152–6163. PMLR, 2021.
- Mind the gap: Offline policy optimization for imperfect rewards. arXiv preprint arXiv:2302.01667, 2023.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Phasic self-imitative reduction for sparse-reward goal-conditioned reinforcement learning. In International Conference on Machine Learning, pp. 12765–12781. PMLR, 2022.
- Learning smooth neural functions via lipschitz regularization. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–13, 2022.
- Mixmae: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6252–6261, 2023.
- Offline goal-conditioned reinforcement learning via f𝑓fitalic_f-advantage regression. Advances in Neural Information Processing Systems, 35:310–323, 2022.
- LIV: Language-image representations and rewards for robotic control. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 23301–23320. PMLR, 23–29 Jul 2023a. URL https://proceedings.mlr.press/v202/ma23b.html.
- Vip: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations, 2023b.
- What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022.
- Structured world models from human videos. arXiv preprint arXiv:2308.10901, 2023.
- Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
- Goal representations for instruction following: A semi-supervised language interface to control. In Conference on Robot Learning, pp. 3894–3908. PMLR, 2023.
- Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pp. 1303–1315. PMLR, 2022.
- R3m: A universal visual representation for robot manipulation. In Liu, K., Kulic, D., and Ichnowski, J. (eds.), Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pp. 892–909. PMLR, 14–18 Dec 2023.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pp. 278–287. Citeseer, 1999.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- A review on deep learning techniques for video prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):2806–2826, 2020.
- The unsurprising effectiveness of pre-trained vision models for control. In International Conference on Machine Learning, pp. 17359–17371. PMLR, 2022.
- Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021a.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021b.
- Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pp. 416–426. PMLR, 2023.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Can wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122, 2022.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning, pp. 19561–19579. PMLR, 2022.
- Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1134–1141. IEEE, 2018.
- Mutex: Learning unified policies from multimodal task specifications. In 7th Annual Conference on Robot Learning, 2023.
- Rrl: Resnet as representation for reinforcement learning. In International Conference on Machine Learning, pp. 9465–9476. PMLR, 2021.
- Sampling strategies for real-time action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2595–2602, 2013.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pp. 785–799. PMLR, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Lipschitz regularity of deep neural networks: analysis and efficient estimation. Advances in Neural Information Processing Systems, 31, 2018.
- Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp. 1723–1736. PMLR, 2023.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023.
- Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
- Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017.
- Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
- Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In 9th International Conference on Learning Representations, ICLR 2021, 2021.
- Long-horizon video prediction using a dynamic latent hierarchy. arXiv preprint arXiv:2212.14376, 2022.
- Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=-2FCwDKRREu.
- A closer look at video sampling for sequential action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- Mgsampler: An explainable sampling strategy for video action recognition. In Proceedings of the IEEE/CVF International conference on Computer Vision, pp. 1513–1522, 2021.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pp. 2165–2183. PMLR, 2023.