EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models (2311.15596v2)
Abstract: Vision-LLMs (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- Openflamingo, 2023a.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023b.
- Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In Proceedings of the IEEE international conference on computer vision, pages 1949–1957, 2015.
- Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
- Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Learning to detect scene landmarks for camera localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11132–11142, 2022.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Chenyou Fan. Egovqa-an egocentric video question answering benchmark dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Describing objects by their attributes. In 2009 IEEE conference on computer vision and pattern recognition, pages 1778–1785. IEEE, 2009.
- Understanding egocentric activities. In 2011 international conference on computer vision, pages 407–414. IEEE, 2011.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Next-active-object prediction from egocentric videos. Journal of Visual Communication and Image Representation, 49:401–411, 2017.
- Physically grounded vision-language models for robotic manipulation. arXiv preprint arXiv:2309.02561, 2023a.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023b.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- Evidence for the embodiment of space perception: concurrent hand but not arm action moderates reachability and egocentric distance perception. Frontiers in Psychology, 6:862, 2015.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Visual landmarks detection and recognition for mobile robot navigation. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., pages II–II. IEEE, 2003.
- rob@ work: Robot assistant in industrial environments. In Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication, pages 399–404. IEEE, 2002.
- Affordance prediction via learned object attributes. In IEEE international conference on robotics and automation (ICRA): Workshop on semantic perception, mapping, and exploration, pages 181–184. Citeseer, 2011.
- A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- A pointing gesture based egocentric interaction system: Dataset, approach and application. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 16–23, 2016.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
- Egotaskqa: Understanding human tasks in egocentric videos. Advances in Neural Information Processing Systems, 35:3343–3360, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
- Human activity recognition: A survey. Procedia Computer Science, 155:698–703, 2019.
- Objects, attributes, and visual attention: Which, what, and where. Current Directions in Psychological Science, 1(1):26–31, 1992.
- A review on video-based human activity recognition. Computers, 2(2):88–131, 2013.
- Roberta L Klatzky. Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections. In Spatial cognition: An interdisciplinary approach to representing and processing spatial knowledge, pages 1–17. Springer, 1998.
- F-vlm: Open-vocabulary object detection upon frozen vision and language models. arXiv preprint arXiv:2209.15639, 2022.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning, 2023b.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
- Deep learning for generic object detection: A survey. International journal of computer vision, 128:261–318, 2020a.
- Forecasting human-object interaction: Joint prediction of motor attention and egocentric activity. Computer Vision–ECCV 2020, 12346:704–721, 2020b.
- Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022a.
- Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022b.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
- FastSLAM: A scalable method for the simultaneous localization and mapping problem in robotics. Springer, 2007.
- Learning object affordances: from sensory–motor coordination to imitation. IEEE Transactions on Robotics, 24(1):15–26, 2008.
- From allo-to egocentric spatial ability in early alzheimer’s disease: a study with virtual reality spatial tasks. Cognitive neuroscience, 4(3-4):171–180, 2013.
- Allocentric and egocentric updating of spatial memories. Journal of experimental psychology: Learning, Memory, and Cognition, 30(1):142, 2004.
- Recognition of activities of daily living with egocentric vision: A review. Sensors, 16(1):72, 2016.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Path planning navigation of mobile robot with obstacles avoidance using fuzzy logic controller. In 2014 IEEE 8th international conference on intelligent systems and control (ISCO), pages 39–41. IEEE, 2014.
- A review: On path planning strategies for navigation of mobile robot. Defence Technology, 15(4):582–606, 2019.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
- Paco: Parts and attributes of common objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023.
- Multiple-robot simultaneous localization and mapping: A review. Journal of Field Robotics, 33(1):3–46, 2016.
- A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pages 146–162. Springer, 2022.
- Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729, 2023.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
- Distilling internet-scale vision-language models into embodied agents. arXiv preprint arXiv:2301.12507, 2023.
- Interacting with a robot: a guide robot understanding natural language instructions. In International Conference on Ubiquitous Computing and Ambient Intelligence, pages 185–192. Springer, 2012.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- A review of human activity recognition methods. Frontiers in Robotics and AI, 2:28, 2015.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023.
- A neural integrator model for planning and value-based decision making of a robotics assistant. Neural Computing and Applications, 33:3737–3756, 2021.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
- Octopus: Embodied vision-language programmer from environmental feedback. arXiv preprint arXiv:2310.08588, 2023a.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023b.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023c.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.
- Egogesture: A new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia, 20(5):1038–1050, 2018.
- Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
- Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232, 2019.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Vlue: A multi-task benchmark for evaluating vision-language models. arXiv preprint arXiv:2205.15237, 2022.
- Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20110–20120, 2023a.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023c.
- Object detection in 20 years: A survey. Proceedings of the IEEE, 2023.
- Sijie Cheng (23 papers)
- Zhicheng Guo (18 papers)
- Jingwen Wu (73 papers)
- Kechen Fang (2 papers)
- Peng Li (390 papers)
- Huaping Liu (97 papers)
- Yang Liu (2253 papers)