Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OSCaR: Object State Captioning and State Change Representation (2402.17128v4)

Published 27 Feb 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal LLMs (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. What action causes this? towards naive physical action-effect prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 934–945, 2018.
  2. Physical causality of action verbs in grounded language understanding. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1814–1824, 2016.
  3. Joint discovery of object states and manipulation actions. In Proceedings of the IEEE International Conference on Computer Vision, pages 2127–2136, 2017.
  4. Multimodal embodied plan prediction augmented with synthetic embodied dialogue. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6114–6131, 2023.
  5. Open-ended instructable embodied agents with memory-augmented large language models. arXiv preprint arXiv:2310.15127, 2023.
  6. Pretraining on interactions for learning grounded affordance representations. arXiv preprint arXiv:2207.02272, 2022.
  7. Multimodal dialogue state tracking. arXiv preprint arXiv:2206.07898, 2022.
  8. Craft: A benchmark for causal reasoning about forces and interactions. arXiv preprint arXiv:2012.04293, 2020.
  9. Localizing active objects from egocentric vision with symbolic world knowledge. arXiv preprint arXiv:2310.15066, 2023.
  10. Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188, 2021.
  11. Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 169–185, 2018.
  12. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
  13. Learning procedure-aware video representation from instructional videos and their narrations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14825–14835, 2023.
  14. Manipulate by seeing: Creating manipulation controllers from pre-trained representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3859–3868, 2023.
  15. Jointly recognizing object fluents and tasks in egocentric videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2924–2932, 2017.
  16. Look for the change: Learning object states and state-modifying actions from untrimmed web videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13956–13966, 2022.
  17. Learning graph embeddings for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 953–962, 2021.
  18. Chop & learn: Recognizing and generating object-state compositions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20247–20258, 2023.
  19. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  21. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  22. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  23. Transfer visual prompt generator across llms. abs/23045.01278, 2023a.
  24. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  25. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  26. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  27. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023b. URL https://api.semanticscholar.org/CorpusID:256390509.
  29. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  30. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  31. Scaling egocentric vision: The epic-kitchens dataset. ArXiv, abs/1804.02748, 2018. URL https://api.semanticscholar.org/CorpusID:4710439.
  32. Ego4d: Around the world in 3,000 hours of egocentric video. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18973–18990, 2021. URL https://api.semanticscholar.org/CorpusID:238856888.
  33. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023b. URL https://api.semanticscholar.org/CorpusID:258291930.
  34. Shikra: Unleashing multimodal llm’s referential dialogue magic. ArXiv, abs/2306.15195, 2023. URL https://api.semanticscholar.org/CorpusID:259262082.
  35. Visual instruction tuning. ArXiv, abs/2304.08485, 2023b. URL https://api.semanticscholar.org/CorpusID:258179774.
  36. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023c. URL https://api.semanticscholar.org/CorpusID:258588306.
  37. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023d.
  38. Learning video representations from large language models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6586–6597, 2022. URL https://api.semanticscholar.org/CorpusID:254408789.
  39. Video-llama: An instruction-tuned audio-visual language model for video understanding. ArXiv, abs/2306.02858, 2023b. URL https://api.semanticscholar.org/CorpusID:259075356.
  40. Visual dialog. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1080–1089, 2016. URL https://api.semanticscholar.org/CorpusID:1820614.
  41. Vqa: Visual question answering. International Journal of Computer Vision, 123:4 – 31, 2015. URL https://api.semanticscholar.org/CorpusID:3180429.
  42. Dictionary-guided scene text recognition. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7379–7388, 2021. URL https://api.semanticscholar.org/CorpusID:235341891.
  43. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022. URL https://api.semanticscholar.org/CorpusID:246411402.
Citations (4)

Summary

We haven't generated a summary for this paper yet.