Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction (2402.11792v2)

Published 19 Feb 2024 in cs.RO

Abstract: Linguistic ambiguity is ubiquitous in our daily lives. Previous works adopted interaction between robots and humans for language disambiguation. Nevertheless, when interactive robots are deployed in daily environments, there are significant challenges for natural human-robot interaction, stemming from complex and unpredictable visual inputs, open-ended interaction, and diverse user demands. In this paper, we present SInViG, which is a self-evolving interactive visual agent for human-robot interaction based on natural languages, aiming to resolve language ambiguity, if any, through multi-turn visual-language dialogues. It continuously and automatically learns from unlabeled images and LLMs, without human intervention, to be more robust against visual and linguistic complexity. Benefiting from self-evolving, it sets new state-of-the-art on several interactive visual grounding benchmarks. Moreover, our human-robot interaction experiments show that the evolved models consistently acquire more and more preferences from human users. Besides, we also deployed our model on a Franka robot for interactive manipulation tasks. Results demonstrate that our model can follow diverse user instructions and interact naturally with humans in natural language, despite the complexity and disturbance of the environment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569, 2019.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, pages 382–398. Springer, 2016.
  5. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  6. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  7. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  8. Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
  9. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  10. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
  11. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  12. Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909, 2019.
  13. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  14. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017.
  15. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5503–5512, 2017.
  16. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1769–1779, 2021.
  17. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  18. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
  19. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  20. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  21. Human–robot interaction: A survey. Foundations and Trends® in Human–Computer Interaction, 1(3):203–275, 2008.
  22. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
  23. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  24. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  25. Deep q-learning from demonstrations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  26. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  27. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  28. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  29. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
  30. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  31. Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. arXiv preprint arXiv:1903.03166, 2019.
  32. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2017.
  33. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  34. Robotic indoor scene captioning from streaming video. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6109–6115. IEEE, 2021.
  35. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023.
  36. Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
  37. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  38. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  39. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017.
  40. Unified questioner transformer for descriptive question generation in goal-oriented visual dialogue. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1898–1907, 2021.
  41. Composing pick-and-place tasks by grounding language. In International Symposium on Experimental Robotics (ISER), 2021.
  42. Towards open-world interactive disambiguation for robotic grasping. In 2023 International Conference on Robotics and Automation (ICRA), 2023.
  43. Towards automatic generation of question answer pairs from images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1–2, 2016.
  44. Generating natural questions about an image. arXiv preprint arXiv:1603.06059, 2016.
  45. Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251, 2017.
  46. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision, pages 792–807. Springer, 2016.
  47. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
  48. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  49. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  50. Visual dialogue state tracking for question generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11831–11838, 2020.
  51. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  52. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  53. Affordancellm: Grounding affordance from vision language models. arXiv preprint arXiv:2401.06341, 2024.
  54. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia, 2020.
  55. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  56. Perceptual grouping in contrastive vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5571–5584, 2023.
  57. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
  58. A survey of evaluation metrics used for nlg systems. ACM Computing Surveys (CSUR), 55(2):1–39, 2022.
  59. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  60. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  61. The art of llm refinement: Ask, refine, and trust. arXiv preprint arXiv:2311.07961, 2023.
  62. Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831, 2018.
  63. Ingress: Interactive visual grounding of referring expressions. The International Journal of Robotics Research, 39(2-3):217–232, 2020.
  64. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
  65. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13438–13444. IEEE, 2021.
  66. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  67. Improving grounded natural language understanding through human-robot dialog. In 2019 International Conference on Robotics and Automation (ICRA), pages 6934–6941. IEEE, 2019.
  68. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  69. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  70. Learning better visual dialog agents with pretrained visual-linguistic representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5622–5631, 2021.
  71. A survey on self-supervised representation learning. arXiv preprint arXiv:2308.11455, 2023.
  72. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  73. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  74. Reinforcement learning with perturbed rewards. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 6202–6209, 2020.
  75. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023a.
  76. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  77. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
  78. Reducing errors in object-fetching interactions through social feedback. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1006–1013. IEEE, 2017.
  79. Towards unified interactive visual grounding in the wild. In 2024 International Conference on Robotics and Automation (ICRA), 2024.
  80. Interactive robotic grasping with attribute-guided disambiguation. In 2022 International Conference on Robotics and Automation (ICRA), pages 8914–8920. IEEE, 2022.
  81. Modeling context in referring expressions, 2016.
  82. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018.
  83. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  84. Multi-grained vision language pre-training: Aligning texts with visual concepts. In Proceedings of the 39th international conference on Machine learning, 2022.
  85. Visual manipulation relationship network for autonomous robotics. In 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), pages 118–125. IEEE, 2018.
  86. Invigorate: Interactive visual grounding and grasping in clutter. arXiv preprint arXiv:2108.11092, 2021a.
  87. Invig: Benchmarking interactive visual grounding with 500k human-robot interactions. arXiv preprint arXiv:2310.12147, 2023.
  88. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2021b.
  89. Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 649–666. Springer, 2016.
  90. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  91. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
  92. Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision, pages 350–368. Springer, 2022b.
  93. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  94. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023b.
Citations (3)

Summary

We haven't generated a summary for this paper yet.