Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models (2311.15596v2)

Published 27 Nov 2023 in cs.CV and cs.CL

Abstract: Vision-LLMs (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  4. Openflamingo, 2023a.
  5. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023b.
  6. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In Proceedings of the IEEE international conference on computer vision, pages 1949–1957, 2015.
  7. Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410, 2023.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  10. Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond, 2023.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  13. Learning to detect scene landmarks for camera localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11132–11142, 2022.
  14. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  15. Chenyou Fan. Egovqa-an egocentric video question answering benchmark dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  16. Describing objects by their attributes. In 2009 IEEE conference on computer vision and pattern recognition, pages 1778–1785. IEEE, 2009.
  17. Understanding egocentric activities. In 2011 international conference on computer vision, pages 407–414. IEEE, 2011.
  18. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  19. Next-active-object prediction from egocentric videos. Journal of Visual Communication and Image Representation, 49:401–411, 2017.
  20. Physically grounded vision-language models for robotic manipulation. arXiv preprint arXiv:2309.02561, 2023a.
  21. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023b.
  22. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  23. Evidence for the embodiment of space perception: concurrent hand but not arm action moderates reachability and egocentric distance perception. Frontiers in Psychology, 6:862, 2015.
  24. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  25. Visual landmarks detection and recognition for mobile robot navigation. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., pages II–II. IEEE, 2003.
  26. rob@ work: Robot assistant in industrial environments. In Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication, pages 399–404. IEEE, 2002.
  27. Affordance prediction via learned object attributes. In IEEE international conference on robotics and automation (ICRA): Workshop on semantic perception, mapping, and exploration, pages 181–184. Citeseer, 2011.
  28. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019.
  29. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
  30. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  31. A pointing gesture based egocentric interaction system: Dataset, approach and application. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 16–23, 2016.
  32. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  33. Egotaskqa: Understanding human tasks in egocentric videos. Advances in Neural Information Processing Systems, 35:3343–3360, 2022.
  34. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  35. Human activity recognition: A survey. Procedia Computer Science, 155:698–703, 2019.
  36. Objects, attributes, and visual attention: Which, what, and where. Current Directions in Psychological Science, 1(1):26–31, 1992.
  37. A review on video-based human activity recognition. Computers, 2(2):88–131, 2013.
  38. Roberta L Klatzky. Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections. In Spatial cognition: An interdisciplinary approach to representing and processing spatial knowledge, pages 1–17. Springer, 1998.
  39. F-vlm: Open-vocabulary object detection upon frozen vision and language models. arXiv preprint arXiv:2209.15639, 2022.
  40. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  41. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  42. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  43. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  44. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  45. Visual instruction tuning, 2023b.
  46. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
  47. Deep learning for generic object detection: A survey. International journal of computer vision, 128:261–318, 2020a.
  48. Forecasting human-object interaction: Joint prediction of motor attention and egocentric activity. Computer Vision–ECCV 2020, 12346:704–721, 2020b.
  49. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022a.
  50. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022b.
  51. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
  52. FastSLAM: A scalable method for the simultaneous localization and mapping problem in robotics. Springer, 2007.
  53. Learning object affordances: from sensory–motor coordination to imitation. IEEE Transactions on Robotics, 24(1):15–26, 2008.
  54. From allo-to egocentric spatial ability in early alzheimer’s disease: a study with virtual reality spatial tasks. Cognitive neuroscience, 4(3-4):171–180, 2013.
  55. Allocentric and egocentric updating of spatial memories. Journal of experimental psychology: Learning, Memory, and Cognition, 30(1):142, 2004.
  56. Recognition of activities of daily living with egocentric vision: A review. Sensors, 16(1):72, 2016.
  57. OpenAI. Gpt-4 technical report, 2023.
  58. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  59. Path planning navigation of mobile robot with obstacles avoidance using fuzzy logic controller. In 2014 IEEE 8th international conference on intelligent systems and control (ISCO), pages 39–41. IEEE, 2014.
  60. A review: On path planning strategies for navigation of mobile robot. Defence Technology, 15(4):582–606, 2019.
  61. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  62. Paco: Parts and attributes of common objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023.
  63. Multiple-robot simultaneous localization and mapping: A review. Journal of Field Robotics, 33(1):3–46, 2016.
  64. A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pages 146–162. Springer, 2022.
  65. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729, 2023.
  66. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
  67. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  68. Distilling internet-scale vision-language models into embodied agents. arXiv preprint arXiv:2301.12507, 2023.
  69. Interacting with a robot: a guide robot understanding natural language instructions. In International Conference on Ubiquitous Computing and Ambient Intelligence, pages 185–192. Springer, 2012.
  70. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  71. A review of human activity recognition methods. Frontiers in Robotics and AI, 2:28, 2015.
  72. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023.
  73. A neural integrator model for planning and value-based decision making of a robotics assistant. Neural Computing and Applications, 33:3737–3756, 2021.
  74. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  75. Octopus: Embodied vision-language programmer from environmental feedback. arXiv preprint arXiv:2310.08588, 2023a.
  76. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023b.
  77. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023c.
  78. mplug-owl: Modularization empowers large language models with multimodality, 2023.
  79. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.
  80. Egogesture: A new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia, 20(5):1038–1050, 2018.
  81. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
  82. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232, 2019.
  83. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  84. Vlue: A multi-task benchmark for evaluating vision-language models. arXiv preprint arXiv:2205.15237, 2022.
  85. Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20110–20120, 2023a.
  86. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.
  87. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023c.
  88. Object detection in 20 years: A survey. Proceedings of the IEEE, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sijie Cheng (23 papers)
  2. Zhicheng Guo (18 papers)
  3. Jingwen Wu (73 papers)
  4. Kechen Fang (2 papers)
  5. Peng Li (390 papers)
  6. Huaping Liu (97 papers)
  7. Yang Liu (2253 papers)
Citations (5)