Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge (2405.09713v2)

Published 15 May 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically within dynamic, open-world, and structured context knowledge. We propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset, we propose an automatic and scalable generation method to generate question-answer pairs, knowledge graphs, and rationales by instructing the combinations of LLMs and MLLMs. Concretely, we first extract observable situated entities, relations, and processes from videos for situated knowledge and then extend to open-world knowledge beyond the visible content. The task generation is facilitated through multiple dialogues as iterations and subsequently corrected and refined by our designed self-promptings and demonstrations. With a corpus of both explicit situated facts and implicit commonsense, we generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance. We evaluated recent mainstream large vision-LLMs on the benchmark and found several insightful conclusions. For more information, please refer to our benchmark at www.bobbywu.com/SOKBench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Vqa: Visual question answering. In ICCV, 2015.
  2. Craft: A benchmark for causal reasoning about forces and interactions. arXiv, 2020.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Revisiting the” video” in video-language understanding. In CVPR, 2022.
  5. ireason: Multimodal commonsense reasoning using videos and natural language with interpretability. arXiv preprint arXiv:2107.10300, 2021.
  6. Webqa: Multihop and multimodal qa. In CVPR, pages 16495–16504, 2022.
  7. Comphy: Compositional physical reasoning of objects and events from videos. In International Conference on Learning Representations, 2022.
  8. Is gpt-3 a good data annotator? arXiv, 2022.
  9. Video2Commonsense: Generating commonsense descriptions to enrich video captioning. In EMNLP, 2020.
  10. Agqa: A benchmark for compositional spatio-temporal reasoning. In CVPR, pages 11287–11297, 2021.
  11. Newskvqa: Knowledge-aware news video question answering. In Pacific-asia conference on knowledge discovery and data mining, pages 3–15. Springer, 2022.
  12. Egotv: Egocentric task verification from natural language task descriptions. arXiv, 2023.
  13. Measuring massive multitask language understanding. ICLR, 2021.
  14. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  15. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
  16. Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR, pages 10236–10247, 2020.
  17. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017a.
  18. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2989–2998, 2017b.
  19. Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
  20. From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. In CVPR (CVPR), 2022.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  22. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  23. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  24. Violin: A large-scale dataset for video-and-language inference. In CVPR, 2020.
  25. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. arXiv, 2023.
  26. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
  27. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  28. Marioqa: Answering questions by watching gameplay videos. In ICCV, 2017.
  29. Ms marco: A human-generated machine reading comprehension dataset. 2016.
  30. OpenAI. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/chatgpt, 2023. Version 4.0.
  31. OpenAI. Gpt-4 technical report, 2023a.
  32. R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023b.
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  34. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  35. Visualcomet: Reasoning about the dynamic context of a still image. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 508–524. Springer, 2020.
  36. Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering. arXiv preprint arXiv:2211.03779, 2022.
  37. Gpt self-supervision for a better data annotator. arXiv, 2023.
  38. Instruction tuning with gpt-4, 2023.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  40. Home action genome: Cooperative compositional action understanding. In CVPR, pages 11184–11193, 2021.
  41. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  42. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  43. Commonsense knowledge in machine intelligence. ACM SIGMOD Record, 46(4):49–52, 2018.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  45. Learning situation hyper-graphs for video question answering. In CVPR, pages 14879–14889, 2023.
  46. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 40(10), 2017.
  47. Star: A benchmark for situated reasoning in real-world videos. In NeurIPS, 2021.
  48. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
  49. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
  50. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
  51. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv, 2018.
  52. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2020.
  53. Self-chained image-language model for video localization and question answering. 2023.
  54. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
  55. Social-iq: A question answering benchmark for artificial social intelligence. In CVPR, 2019.
  56. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326, 2018.
  57. From recognition to cognition: Visual commonsense reasoning. In CVPR, pages 6720–6731, 2019a.
  58. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019b.
  59. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  60. Bertscore: Evaluating text generation with bert, 2020.
  61. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  62. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  63. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Andong Wang (16 papers)
  2. Bo Wu (144 papers)
  3. Sunli Chen (6 papers)
  4. Zhenfang Chen (36 papers)
  5. Haotian Guan (4 papers)
  6. Wei-Ning Lee (6 papers)
  7. Li Erran Li (37 papers)
  8. Chuang Gan (195 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com