Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuTT: A Multimodal Trajectory Transformer for Robot Skills (2407.15660v2)

Published 22 Jul 2024 in cs.RO and cs.LG

Abstract: High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. L. Johannsmeier, M. Gerchow, and S. Haddadin, “A Framework for Robot Manipulation: Skill Formalism, Meta Learning and Adaptive Control,” in 2019 International Conference on Robotics and Automation (ICRA), May 2019, pp. 5844–5850, iSSN: 2577-087X.
  2. J. A. Marvel, W. S. Newman, D. P. Gravel, G. Zhang, Jianjun Wang, and T. Fuhlbrigge, “Automated learning for parameter optimization of robotic assembly tasks utilizing genetic algorithms,” in 2008 IEEE International Conference on Robotics and Biomimetics, Feb. 2009, pp. 179–184.
  3. U. Thomas, G. Hirzinger, B. Rumpe, C. Schulze, and A. Wortmann, “A new skill based robot programming language using UML/P Statecharts,” in 2013 IEEE International Conference on Robotics and Automation, May 2013, pp. 461–466, iSSN: 1050-4729.
  4. M. R. Pedersen, L. Nalpantidis, R. S. Andersen, C. Schou, S. Bøgh, V. Krüger, and O. Madsen, “Robot skills for manufacturing: From concept to industrial deployment,” Robotics and Computer-Integrated Manufacturing, vol. 37, pp. 282–291, Feb. 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0736584515000575
  5. A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal, “Dynamical movement primitives: learning attractor models for motor behaviors,” Neural computation, vol. 25, no. 2, pp. 328–373, 2013, publisher: MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info ….
  6. G. Li, Z. Jin, M. Volpp, F. Otto, R. Lioutikov, and G. Neumann, “ProDMP: A Unified Perspective on Dynamic and Probabilistic Movement Primitives,” IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 2325–2332, 2023, publisher: IEEE. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10050558
  7. S. Schaal, “Dynamic Movement Primitives -A Framework for Motor Control in Humans and Humanoid Robotics,” in Adaptive Motion of Animals and Machines, H. Kimura, K. Tsuchiya, A. Ishiguro, and H. Witte, Eds.   Tokyo: Springer, 2006, pp. 261–280. [Online]. Available: https://doi.org/10.1007/4-431-31381-8˙23
  8. H. Bruyninckx and J. De Schutter, “Specification of force-controlled actions in the ”task frame formalism”-a synthesis,” IEEE Transactions on Robotics and Automation, vol. 12, no. 4, pp. 581–589, Aug. 1996, conference Name: IEEE Transactions on Robotics and Automation. [Online]. Available: https://ieeexplore.ieee.org/document/508440
  9. A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation Learning: A Survey of Learning Methods,” ACM Computing Surveys, vol. 50, no. 2, pp. 1–35, Mar. 2018. [Online]. Available: https://dl.acm.org/doi/10.1145/3054912
  10. B. Alt, D. Katic, R. Jäkel, A. K. Bozcuoglu, and M. Beetz, “Robot Program Parameter Inference via Differentiable Shadow Program Inversion,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 4672–4678, iSSN: 2577-087X.
  11. J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, et al., “ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills,” Feb. 2023, arXiv:2302.04659 [cs]. [Online]. Available: http://arxiv.org/abs/2302.04659
  12. T. Le, H. T. Nguyen, and M. L. Nguyen, “Vision And Text Transformer For Predicting Answerability On Visual Question Answering,” in 2021 IEEE International Conference on Image Processing (ICIP), Sept. 2021, pp. 934–938, iSSN: 2381-8549.
  13. J. Wu, Y. Peng, S. Zhang, W. Qi, and J. Zhang, “Masked Vision-Language Transformers for Scene Text Recognition,” Nov. 2022, arXiv:2211.04785 [cs]. [Online]. Available: http://arxiv.org/abs/2211.04785
  14. Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, and C. Jawahar, “MMBERT: Multimodal BERT Pretraining for Improved Medical VQA,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Apr. 2021, pp. 1033–1036, iSSN: 1945-8452.
  15. W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5583–5594.
  16. J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y. Wu, et al., “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing,” May 2022, arXiv:2110.07205 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2110.07205
  17. B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction,” Mar. 2022, arXiv:2201.02184 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2201.02184
  18. L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “VisualBERT: A Simple and Performant Baseline for Vision and Language,” Aug. 2019, arXiv:1908.03557 [cs]. [Online]. Available: http://arxiv.org/abs/1908.03557
  19. M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” Mar. 2024, arXiv:2403.05530 [cs]. [Online]. Available: http://arxiv.org/abs/2403.05530
  20. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” July 2023, arXiv:2307.15818 [cs]. [Online]. Available: http://arxiv.org/abs/2307.15818
  21. D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, et al., “Octo: An Open-Source Generalist Robot Policy.”
  22. D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, et al., “PaLM-E: An Embodied Multimodal Language Model,” Mar. 2023, arXiv:2303.03378 [cs]. [Online]. Available: http://arxiv.org/abs/2303.03378
  23. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proceedings of the 38th International Conference on Machine Learning.   PMLR, July 2021, pp. 8748–8763, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
  24. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021, publisher: IEEE.
  25. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, et al., “RT-1: Robotics Transformer for Real-World Control at Scale,” Aug. 2023, arXiv:2212.06817 [cs]. [Online]. Available: http://arxiv.org/abs/2212.06817
  26. V. Lim, H. Huang, L. Y. Chen, J. Wang, J. Ichnowski, D. Seita, et al., “Real2Sim2Real: Self-Supervised Learning of Physical Single-Step Dynamic Actions for Planar Robot Casting,” in 2022 International Conference on Robotics and Automation (ICRA), May 2022, pp. 8282–8289. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9811651
  27. B. Sukhija, N. Köhler, M. Zamora, S. Zimmermann, S. Curi, A. Krause, and S. Coros, “Gradient-Based Trajectory Optimization With Learned Dynamics,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 1011–1018.
  28. Y. Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, et al., “Learning Universal Policies via Text-Guided Video Generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 9156–9172, Dec. 2023. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2023/hash/1d5b9233ad716a43be5c0d3023cb82d0-Abstract-Conference.html
  29. T. Zhang, C. Yuan, and Y. Zou, “Online Optimization Method of Controller Parameters for Robot Constant Force Grinding Based on Deep Reinforcement Learning Rainbow,” Journal of Intelligent & Robotic Systems, vol. 105, no. 4, p. 85, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10846-022-01688-z
  30. S. Höfer, K. Bekris, A. Handa, J. C. Gamboa, M. Mozifian, F. Golemo, et al., “Sim2Real in Robotics and Automation: Applications and Challenges,” IEEE Transactions on Automation Science and Engineering, vol. 18, no. 2, pp. 398–400, Apr. 2021, conference Name: IEEE Transactions on Automation Science and Engineering.
  31. B. Alt, D. Katic, R. Jäkel, and M. Beetz, “Heuristic-free Optimization of Force-Controlled Robot Search Strategies in Stochastic Environments,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2022, pp. 8887–8893, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/9982093
  32. J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, Nov. 1995, pp. 1942–1948 vol.4.
  33. B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the Human Out of the Loop: A Review of Bayesian Optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, Jan. 2016, conference Name: Proceedings of the IEEE.
  34. T. Bäck and H.-P. Schwefel, “An overview of evolutionary algorithms for parameter optimization,” Evol. Comput., vol. 1, no. 1, pp. 1–23, Mar. 1993. [Online]. Available: https://dl.acm.org/doi/10.1162/evco.1993.1.1.1
  35. F. Berkenkamp, A. Krause, and A. P. Schoellig, “Bayesian Optimization with Safety Constraints: Safe and Automatic Parameter Tuning in Robotics,” arXiv:1602.04450 [cs], Apr. 2020. [Online]. Available: http://arxiv.org/abs/1602.04450
  36. R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian optimization for learning gaits under uncertainty,” Ann Math Artif Intell, vol. 76, no. 1, pp. 5–23, Feb. 2016. [Online]. Available: https://doi.org/10.1007/s10472-015-9463-9
  37. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30.   Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  38. E. Bugliarello, R. Cotterell, N. Okazaki, and D. Elliott, “Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language BERTs,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 978–994, 2021, publisher: MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …. [Online]. Available: https://direct.mit.edu/tacl/article-abstract/doi/10.1162/tacl˙a˙00408/107279
  39. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” June 2021, arXiv:2010.11929 [cs]. [Online]. Available: http://arxiv.org/abs/2010.11929
  40. M. Janner, Q. Li, and S. Levine, “Offline Reinforcement Learning as One Big Sequence Modeling Problem,” Nov. 2021, arXiv:2106.02039 [cs]. [Online]. Available: http://arxiv.org/abs/2106.02039
  41. J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in International conference on machine learning.   PMLR, 2017, pp. 1243–1252. [Online]. Available: http://proceedings.mlr.press/v70/gehring17a.html?ref=https://githubhelp.com
  42. S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-Driven Representation Learning for Robotics,” Feb. 2023, arXiv:2302.12766 [cs]. [Online]. Available: http://arxiv.org/abs/2302.12766
  43. D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber, “Episodic Reinforcement Learning by Logistic Reward-Weighted Regression,” in Artificial Neural Networks - ICANN 2008, V. Kůrková, R. Neruda, and J. Koutník, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, vol. 5163, pp. 407–416, iSSN: 0302-9743, 1611-3349 Series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-540-87536-9˙42
  44. I. Kostrikov, A. Nair, and S. Levine, “Offline Reinforcement Learning with Implicit Q-Learning,” Oct. 2021, issue: arXiv:2110.06169 arXiv:2110.06169 [cs]. [Online]. Available: http://arxiv.org/abs/2110.06169
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com