Papers
Topics
Authors
Recent
2000 character limit reached

CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes

Published 1 Apr 2024 in cs.CV, cs.AI, cs.CL, and cs.LG | (2404.01299v2)

Abstract: Causal video question answering (QA) has garnered increasing interest, yet existing datasets often lack depth in causal reasoning. To address this gap, we capitalize on the unique properties of cartoons and construct CausalChaos!, a novel, challenging causal Why-QA dataset built upon the iconic "Tom and Jerry" cartoon series. Cartoons use the principles of animation that allow animators to create expressive, unambiguous causal relationships between events to form a coherent storyline. Utilizing these properties, along with thought-provoking questions and multi-level answers (answer and detailed causal explanation), our questions involve causal chains that interconnect multiple dynamic interactions between characters and visual scenes. These factors demand models to solve more challenging, yet well-defined causal relationships. We also introduce hard incorrect answer mining, including a causally confusing version that is even more challenging. While models perform well, there is much room for improvement, especially, on open-ended answers. We identify more advanced/explicit causal relationship modeling & joint modeling of vision and language as the immediate areas for future efforts to focus upon. Along with the other complementary datasets, our new challenging dataset will pave the way for these developments in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer, 2016.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  3. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  4. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
  5. Open-vocabulary universal image segmentation with maskclip. In Proceedings of the 40th International Conference on Machine Learning, pages 8090–8102, 2023.
  6. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1999–2007, 2019.
  7. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783, 2023.
  8. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6576–6585, 2018.
  9. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021.
  10. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE international conference on computer vision, pages 2712–2719, 2013.
  11. Caricaturing shapes in visual memory. Journal of Vision, 23(9):4756–4756, 2023.
  12. Human action recognition without human. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pages 11–17. Springer, 2016.
  13. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017.
  14. Video question answering with spatio-temporal reasoning. International Journal of Computer Vision, 127:1385–1412, 2019.
  15. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11109–11116, 2020.
  16. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9972–9981, 2020.
  17. From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  18. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  19. Intentqa: Context-aware video intent reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11963–11974, 2023b.
  20. The dawn after the dark: An empirical study on factuality hallucination in large language models. arXiv preprint arXiv:2401.03205, 2024a.
  21. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005, 2023c.
  22. Image content generation with causal reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13646–13654, 2024b.
  23. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  24. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024a.
  25. Large language models and causal inference in collaboration: A comprehensive survey. arXiv preprint arXiv:2403.09606, 2024b.
  26. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6884–6893, 2017.
  27. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems, 27, 2014.
  28. Clevrer-humans: Describing physical and causal events the human way. Advances in Neural Information Processing Systems, 35:7755–7768, 2022.
  29. Caricature and face recognition. Memory & Cognition, 20:433–440, 1992.
  30. How is children’s learning from television distinctive? exploiting the medium methodologically. Children’s understanding of television: Research on attention and comprehension, pages 151–180, 1983.
  31. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  32. team OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  33. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  34. Human mesh recovery from multiple shots. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1485–1495, 2022.
  35. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  36. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  37. Identification and ratings of caricatures: Implications for mental representations of faces. Cognitive Psychology, 19(4):473–497, 1987.
  38. Functions of animation in comprehension and learning. Learning with animation: Research implications for design, pages 93–113, 2008.
  39. Does animation facilitate better learning in primary education? a comparative study of three different subjects. Creative Education, 7(13):1800–1809, 2016.
  40. The use of concept cartoons in overcoming the misconception in electricity concepts. Participatory Educational Research, 10(1):310–329, 2023.
  41. Alpha-clip: A clip model focusing on wherever you want. arXiv preprint arXiv:2312.03818, 2023.
  42. The illusion of life : Disney animation. Disney Editions, 1981.
  43. Tom and Jerry Wiki Fandom. Tom and jerry wiki fandom.
  44. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  45. Data association for semantic world modeling from partial views. International Journal of Robotics Research, 34(7):1064–1082, 2015.
  46. STAR: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021.
  47. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
  48. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  49. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023a.
  50. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023b.
  51. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing, 26(12):5656–5666, 2017.
  52. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019.
  53. Peer learning with concept cartoons enhance critical thinking and performance in secondary school economics. Journal of economics and economic education research, 18(1):1–13, 2017.
  54. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019.
  55. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019.
  56. Leveraging video descriptions to learn video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
  57. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, 2023. Association for Computational Linguistics.
  58. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 3683–3689. International Joint Conferences on Artificial Intelligence Organization, 2018.
  59. Video question answering: Datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225, 2022.
  60. Explore spurious correlations at the concept level in language models for text classification. arXiv preprint arXiv:2311.08648, 2023.
  61. Uncovering the temporal context for video question answering. International Journal of Computer Vision, 124:409–421, 2017.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.