Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Object-Centric Instruction Augmentation for Robotic Manipulation (2401.02814v2)

Published 5 Jan 2024 in cs.RO and cs.CV

Abstract: Humans interpret scenes by recognizing both the identities and positions of objects in their observations. For a robot to perform tasks such as \enquote{pick and place}, understanding both what the objects are and where they are located is crucial. While the former has been extensively discussed in the literature that uses the LLM to enrich the text descriptions, the latter remains underexplored. In this work, we introduce the \textit{Object-Centric Instruction Augmentation (OCI)} framework to augment highly semantic and information-dense language instruction with position cues. We utilize a Multi-modal LLM (MLLM) to weave knowledge of object locations into natural language instruction, thus aiding the policy network in mastering actions for versatile manipulation. Additionally, we present a feature reuse mechanism to integrate the vision-language features from off-the-shelf pre-trained MLLM into policy networks. Through a series of simulated and real-world robotic tasks, we demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. L. G. Ungerleider and J. V. Haxby, “‘what’and ‘where’in the human brain,” Current opinion in neurobiology, vol. 4, no. 2, pp. 157–165, 1994.
  2. E. H. de Haan and A. Cowey, “On the usefulness of ‘what’and ‘where’pathways in vision,” Trends in cognitive sciences, vol. 15, no. 10, pp. 460–466, 2011.
  3. M. N. Hebart and G. Hesselmann, “What visual information is processed in the human dorsal stream?” Journal of Neuroscience, vol. 32, no. 24, pp. 8107–8109, 2012.
  4. M. A. Goodale and A. D. Milner, “Separate visual pathways for perception and action,” Trends in neurosciences, vol. 15, no. 1, pp. 20–25, 1992.
  5. B. R. Sheth and R. Young, “Two visual pathways in primates based on sampling of space: exploitation and exploration of visual information,” Frontiers in integrative neuroscience, vol. 10, p. 37, 2016.
  6. E. Freud, J. C. Culham, D. C. Plaut, and M. Behrmann, “The large-scale organization of shape processing in the ventral and dorsal pathways,” elife, vol. 6, p. e27576, 2017.
  7. S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chatgpt for robotics: Design principles and model abilities,” Microsoft Auton. Syst. Robot. Res, vol. 2, p. 20, 2023.
  8. M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
  9. Y. Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, Y. Qiao, and P. Luo, “Embodiedgpt: Vision-language pre-training via embodied chain of thought,” arXiv preprint arXiv:2305.15021, 2023.
  10. T. Xiao, H. Chan, P. Sermanet, A. Wahid, A. Brohan, K. Hausman, S. Levine, and J. Tompson, “Robotic skill acquisition via instruction augmentation with vision-language models,” arXiv preprint arXiv:2211.11736, 2022.
  11. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  12. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  13. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
  14. L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, H. Zhu, A. Tang, D.-A. Huang, Y. Zhu, and A. Anandkumar, “Minedojo: Building open-ended embodied agents with internet-scale knowledge,” Advances in Neural Information Processing Systems, vol. 35, pp. 18 343–18 362, 2022.
  15. M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on Robot Learning.   PMLR, 2022, pp. 894–906.
  16. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning.   PMLR, 2022, pp. 9118–9147.
  17. R. Ma, L. Lam, B. A. Spiegel, A. Ganeshan, R. Patel, B. Abbatematteo, D. Paulius, S. Tellex, and G. Konidaris, “Skill generalization with verbs,” arXiv preprint, 2020.
  18. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 10 608–10 615.
  19. J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 9493–9500.
  20. Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot collaboration with large language models,” arXiv preprint arXiv:2307.04738, 2023.
  21. Y. Cui, S. Karamcheti, R. Palleti, N. Shivakumar, P. Liang, and D. Sadigh, “No, to the right: Online language corrections for robotic manipulation via shared autonomy,” in Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, 2023, pp. 93–101.
  22. D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
  23. Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” arXiv preprint arXiv:2210.03094, 2022.
  24. F. Hill, S. Mokra, N. Wong, and T. Harley, “Human instruction-following with deep reinforcement learning via transfer-learning from text,” arXiv preprint arXiv:2005.09382, 2020.
  25. E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning.   PMLR, 2022, pp. 991–1002.
  26. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
  27. S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn et al., “Learning language-conditioned robot behavior from offline data and crowd-sourced annotation,” in Conference on Robot Learning.   PMLR, 2022, pp. 1303–1315.
  28. C. Lynch and P. Sermanet, “Language conditioned imitation learning over unstructured data,” arXiv preprint arXiv:2005.07648, 2020.
  29. M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning.   PMLR, 2023, pp. 785–799.
  30. Y. J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V. Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre-training,” arXiv preprint arXiv:2210.00030, 2022.
  31. M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” Advances in neural information processing systems, vol. 33, pp. 19 884–19 895, 2020.
  32. R. Shah and V. Kumar, “Rrl: Resnet as representation for reinforcement learning,” arXiv preprint arXiv:2107.03380, 2021.
  33. S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven representation learning for robotics,” arXiv preprint arXiv:2302.12766, 2023.
  34. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Clip on wheels: Zero-shot object navigation as object localization and exploration,” arXiv preprint arXiv:2203.10421, vol. 3, no. 4, p. 7, 2022.
  35. A. Stone, T. Xiao, Y. Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, B. Zitkovich, F. Xia, C. Finn et al., “Open-world object manipulation using pre-trained vision-language models,” arXiv preprint arXiv:2303.00905, 2023.
  36. B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 11 509–11 522.
  37. Y. Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi, “Vision-language models as success detectors,” arXiv preprint arXiv:2303.07280, 2023.
  38. X. Zhang, Y. Ding, S. Amiri, H. Yang, A. Kaminski, C. Esselink, and S. Zhang, “Grounding classical task planners via vision-language models,” arXiv preprint arXiv:2304.08587, 2023.
  39. T. Sumers, K. Marino, A. Ahuja, R. Fergus, and I. Dasgupta, “Distilling internet-scale vision-language models into embodied agents,” arXiv preprint arXiv:2301.12507, 2023.
  40. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023.
  41. J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield, “Deep object pose estimation for semantic robotic grasping of household objects,” arXiv preprint arXiv:1809.10790, 2018.
  42. S. Tyree, J. Tremblay, T. To, J. Cheng, T. Mosier, J. Smith, and S. Birchfield, “6-dof pose estimation of household objects for robotic manipulation: An accessible dataset and benchmark,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 13 081–13 088.
  43. T. Migimatsu and J. Bohg, “Object-centric task and motion planning in dynamic environments,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 844–851, 2020.
  44. D. Wang, C. Devin, Q.-Z. Cai, F. Yu, and T. Darrell, “Deep object-centric policies for autonomous driving,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 8853–8859.
  45. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  46. Y. Zhu, Z. Jiang, P. Stone, and Y. Zhu, “Learning generalizable manipulation policies with object-centric 3d representations,” in 7th Annual Conference on Robot Learning, 2023.
  47. J. Shi, J. Qian, Y. J. Ma, and D. Jayaraman, “Plug-and-play object-centric representations from “what” and “where” foundation models,” arXiv preprint, 2021.
  48. F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 525–11 538, 2020.
  49. C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner, “Monet: Unsupervised scene decomposition and representation,” arXiv preprint arXiv:1901.11390, 2019.
  50. C. Wang, R. Wang, A. Mandlekar, L. Fei-Fei, S. Savarese, and D. Xu, “Generalization through hand-eye coordination: An action space for learning spatially-invariant visuomotor control,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 8913–8920.
  51. N. Heravi, A. Wahid, C. Lynch, P. Florence, T. Armstrong, J. Tompson, P. Sermanet, J. Bohg, and D. Dwibedi, “Visuomotor control in multi-object scenes using object-aware representations,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 9515–9522.
  52. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
  53. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
  54. Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, and J. Tang, “Llava-ϕitalic-ϕ\phiitalic_ϕ: Efficient multi-modal assistant with small language model,” arXiv preprint arXiv:2401.02330, 2024.
  55. V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” Advances in neural information processing systems, vol. 24, 2011.
  56. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
  57. L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14.   Springer, 2016, pp. 69–85.
  58. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  59. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  60. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  61. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  62. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  63. L. Bracha, E. Shaar, A. Shamsian, E. Fetaya, and G. Chechik, “Disclip: Open-vocabulary referring expression generation,” arXiv preprint arXiv:2305.19108, 2023.
  64. Q. Zhou and Y. Zhu, “Make a long image short: Adaptive token length for vision transformers,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2023, pp. 69–85.
  65. Y. Huang, X. Liu, Y. Zhu, Z. Xu, C. Shen, Z. Che, G. Zhang, Y. Peng, F. Feng, and J. Tang, “Label-guided auxiliary training improves 3d object detector,” in European Conference on Computer Vision.   Springer, 2022, pp. 684–700.
  66. P. Zhang, Z. Kang, T. Yang, X. Zhang, N. Zheng, and J. Sun, “Lgd: label-guided self-distillation for object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3309–3317.
  67. Y. Zhu, N. Liu, Z. Xu, X. Liu, W. Meng, L. Wang, Z. Ou, and J. Tang, “Teach less, learn more: On the undistillable classes in knowledge distillation,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 011–32 024, 2022.
  68. Y. Zhu, Q. Zhou, N. Liu, Z. Xu, Z. Ou, X. Mou, and J. Tang, “Scalekd: Distilling scale-aware knowledge in small object detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 723–19 733.
  69. Z. Lin, Z. Zhang, L.-Z. Chen, M.-M. Cheng, and S.-P. Lu, “Interactive image segmentation with first click attention,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 339–13 348.
  70. Z. Lin, Z. Zhang, L.-H. Han, and S.-P. Lu, “Multi-mode interactive image segmentation,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 905–914.
  71. M. Hao, Y. Liu, X. Zhang, and J. Sun, “Labelenc: A new intermediate supervision method for object detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16.   Springer, 2020, pp. 529–545.
  72. S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” arXiv preprint arXiv:2203.12601, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Junjie Wen (19 papers)
  2. Yichen Zhu (51 papers)
  3. Minjie Zhu (14 papers)
  4. Jinming Li (20 papers)
  5. Zhiyuan Xu (47 papers)
  6. Zhengping Che (41 papers)
  7. Chaomin Shen (25 papers)
  8. Yaxin Peng (22 papers)
  9. Dong Liu (267 papers)
  10. Feifei Feng (23 papers)
  11. Jian Tang (327 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.