MuTT: A Multimodal Trajectory Transformer for Robot Skills (2407.15660v2)
Abstract: High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.
- L. Johannsmeier, M. Gerchow, and S. Haddadin, “A Framework for Robot Manipulation: Skill Formalism, Meta Learning and Adaptive Control,” in 2019 International Conference on Robotics and Automation (ICRA), May 2019, pp. 5844–5850, iSSN: 2577-087X.
- J. A. Marvel, W. S. Newman, D. P. Gravel, G. Zhang, Jianjun Wang, and T. Fuhlbrigge, “Automated learning for parameter optimization of robotic assembly tasks utilizing genetic algorithms,” in 2008 IEEE International Conference on Robotics and Biomimetics, Feb. 2009, pp. 179–184.
- U. Thomas, G. Hirzinger, B. Rumpe, C. Schulze, and A. Wortmann, “A new skill based robot programming language using UML/P Statecharts,” in 2013 IEEE International Conference on Robotics and Automation, May 2013, pp. 461–466, iSSN: 1050-4729.
- M. R. Pedersen, L. Nalpantidis, R. S. Andersen, C. Schou, S. Bøgh, V. Krüger, and O. Madsen, “Robot skills for manufacturing: From concept to industrial deployment,” Robotics and Computer-Integrated Manufacturing, vol. 37, pp. 282–291, Feb. 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0736584515000575
- A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal, “Dynamical movement primitives: learning attractor models for motor behaviors,” Neural computation, vol. 25, no. 2, pp. 328–373, 2013, publisher: MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info ….
- G. Li, Z. Jin, M. Volpp, F. Otto, R. Lioutikov, and G. Neumann, “ProDMP: A Unified Perspective on Dynamic and Probabilistic Movement Primitives,” IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 2325–2332, 2023, publisher: IEEE. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10050558
- S. Schaal, “Dynamic Movement Primitives -A Framework for Motor Control in Humans and Humanoid Robotics,” in Adaptive Motion of Animals and Machines, H. Kimura, K. Tsuchiya, A. Ishiguro, and H. Witte, Eds. Tokyo: Springer, 2006, pp. 261–280. [Online]. Available: https://doi.org/10.1007/4-431-31381-8˙23
- H. Bruyninckx and J. De Schutter, “Specification of force-controlled actions in the ”task frame formalism”-a synthesis,” IEEE Transactions on Robotics and Automation, vol. 12, no. 4, pp. 581–589, Aug. 1996, conference Name: IEEE Transactions on Robotics and Automation. [Online]. Available: https://ieeexplore.ieee.org/document/508440
- A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation Learning: A Survey of Learning Methods,” ACM Computing Surveys, vol. 50, no. 2, pp. 1–35, Mar. 2018. [Online]. Available: https://dl.acm.org/doi/10.1145/3054912
- B. Alt, D. Katic, R. Jäkel, A. K. Bozcuoglu, and M. Beetz, “Robot Program Parameter Inference via Differentiable Shadow Program Inversion,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 4672–4678, iSSN: 2577-087X.
- J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, et al., “ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills,” Feb. 2023, arXiv:2302.04659 [cs]. [Online]. Available: http://arxiv.org/abs/2302.04659
- T. Le, H. T. Nguyen, and M. L. Nguyen, “Vision And Text Transformer For Predicting Answerability On Visual Question Answering,” in 2021 IEEE International Conference on Image Processing (ICIP), Sept. 2021, pp. 934–938, iSSN: 2381-8549.
- J. Wu, Y. Peng, S. Zhang, W. Qi, and J. Zhang, “Masked Vision-Language Transformers for Scene Text Recognition,” Nov. 2022, arXiv:2211.04785 [cs]. [Online]. Available: http://arxiv.org/abs/2211.04785
- Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, and C. Jawahar, “MMBERT: Multimodal BERT Pretraining for Improved Medical VQA,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Apr. 2021, pp. 1033–1036, iSSN: 1945-8452.
- W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
- J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y. Wu, et al., “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing,” May 2022, arXiv:2110.07205 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2110.07205
- B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction,” Mar. 2022, arXiv:2201.02184 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2201.02184
- L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “VisualBERT: A Simple and Performant Baseline for Vision and Language,” Aug. 2019, arXiv:1908.03557 [cs]. [Online]. Available: http://arxiv.org/abs/1908.03557
- M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” Mar. 2024, arXiv:2403.05530 [cs]. [Online]. Available: http://arxiv.org/abs/2403.05530
- A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” July 2023, arXiv:2307.15818 [cs]. [Online]. Available: http://arxiv.org/abs/2307.15818
- D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, et al., “Octo: An Open-Source Generalist Robot Policy.”
- D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, et al., “PaLM-E: An Embodied Multimodal Language Model,” Mar. 2023, arXiv:2303.03378 [cs]. [Online]. Available: http://arxiv.org/abs/2303.03378
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proceedings of the 38th International Conference on Machine Learning. PMLR, July 2021, pp. 8748–8763, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021, publisher: IEEE.
- A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, et al., “RT-1: Robotics Transformer for Real-World Control at Scale,” Aug. 2023, arXiv:2212.06817 [cs]. [Online]. Available: http://arxiv.org/abs/2212.06817
- V. Lim, H. Huang, L. Y. Chen, J. Wang, J. Ichnowski, D. Seita, et al., “Real2Sim2Real: Self-Supervised Learning of Physical Single-Step Dynamic Actions for Planar Robot Casting,” in 2022 International Conference on Robotics and Automation (ICRA), May 2022, pp. 8282–8289. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9811651
- B. Sukhija, N. Köhler, M. Zamora, S. Zimmermann, S. Curi, A. Krause, and S. Coros, “Gradient-Based Trajectory Optimization With Learned Dynamics,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 1011–1018.
- Y. Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, et al., “Learning Universal Policies via Text-Guided Video Generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 9156–9172, Dec. 2023. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2023/hash/1d5b9233ad716a43be5c0d3023cb82d0-Abstract-Conference.html
- T. Zhang, C. Yuan, and Y. Zou, “Online Optimization Method of Controller Parameters for Robot Constant Force Grinding Based on Deep Reinforcement Learning Rainbow,” Journal of Intelligent & Robotic Systems, vol. 105, no. 4, p. 85, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10846-022-01688-z
- S. Höfer, K. Bekris, A. Handa, J. C. Gamboa, M. Mozifian, F. Golemo, et al., “Sim2Real in Robotics and Automation: Applications and Challenges,” IEEE Transactions on Automation Science and Engineering, vol. 18, no. 2, pp. 398–400, Apr. 2021, conference Name: IEEE Transactions on Automation Science and Engineering.
- B. Alt, D. Katic, R. Jäkel, and M. Beetz, “Heuristic-free Optimization of Force-Controlled Robot Search Strategies in Stochastic Environments,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2022, pp. 8887–8893, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/9982093
- J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, Nov. 1995, pp. 1942–1948 vol.4.
- B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the Human Out of the Loop: A Review of Bayesian Optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, Jan. 2016, conference Name: Proceedings of the IEEE.
- T. Bäck and H.-P. Schwefel, “An overview of evolutionary algorithms for parameter optimization,” Evol. Comput., vol. 1, no. 1, pp. 1–23, Mar. 1993. [Online]. Available: https://dl.acm.org/doi/10.1162/evco.1993.1.1.1
- F. Berkenkamp, A. Krause, and A. P. Schoellig, “Bayesian Optimization with Safety Constraints: Safe and Automatic Parameter Tuning in Robotics,” arXiv:1602.04450 [cs], Apr. 2020. [Online]. Available: http://arxiv.org/abs/1602.04450
- R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian optimization for learning gaits under uncertainty,” Ann Math Artif Intell, vol. 76, no. 1, pp. 5–23, Feb. 2016. [Online]. Available: https://doi.org/10.1007/s10472-015-9463-9
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- E. Bugliarello, R. Cotterell, N. Okazaki, and D. Elliott, “Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language BERTs,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 978–994, 2021, publisher: MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …. [Online]. Available: https://direct.mit.edu/tacl/article-abstract/doi/10.1162/tacl˙a˙00408/107279
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” June 2021, arXiv:2010.11929 [cs]. [Online]. Available: http://arxiv.org/abs/2010.11929
- M. Janner, Q. Li, and S. Levine, “Offline Reinforcement Learning as One Big Sequence Modeling Problem,” Nov. 2021, arXiv:2106.02039 [cs]. [Online]. Available: http://arxiv.org/abs/2106.02039
- J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in International conference on machine learning. PMLR, 2017, pp. 1243–1252. [Online]. Available: http://proceedings.mlr.press/v70/gehring17a.html?ref=https://githubhelp.com
- S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-Driven Representation Learning for Robotics,” Feb. 2023, arXiv:2302.12766 [cs]. [Online]. Available: http://arxiv.org/abs/2302.12766
- D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber, “Episodic Reinforcement Learning by Logistic Reward-Weighted Regression,” in Artificial Neural Networks - ICANN 2008, V. Kůrková, R. Neruda, and J. Koutník, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, vol. 5163, pp. 407–416, iSSN: 0302-9743, 1611-3349 Series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-540-87536-9˙42
- I. Kostrikov, A. Nair, and S. Levine, “Offline Reinforcement Learning with Implicit Q-Learning,” Oct. 2021, issue: arXiv:2110.06169 arXiv:2110.06169 [cs]. [Online]. Available: http://arxiv.org/abs/2110.06169