FunQA: Towards Surprising Video Comprehension (2306.14899v2)
Abstract: Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question-answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA, 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model's capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counter-intuitiveness. We also pose higher-level tasks, such as attributing a fitting and vivid title to the video and scoring the video creativity. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Moreover, we propose FunMentor, an agent designed for Vision-LLMs (VLMs) that uses multi-turn dialogues to enhance models' understanding of counter-intuitiveness. Extensive experiments with existing VLMs demonstrate the effectiveness of FunMentor and reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.
- On the valence of surprise. Cognition & emotion, 27(7):1326–1334, 2013.
- Mike W Martin. Humour and aesthetic enjoyment of incongruities. The British Journal of Aesthetics, 23(1):74–85, 1983.
- Charles R Gruner. Understanding laughter: The workings of wit & humor. Burnham Incorporated Pub, 1978.
- Michael Billig. Laughter and ridicule: Towards a social critique of humour. Sage, 2005.
- Salvatore Attardo. A primer for the linguistics of humor. The primer of humor research, 8:101–156, 2008.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
- Towards multimodal sarcasm detection (an _obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy, 7 2019. Association for Computational Linguistics.
- UR-FUNNY: A multimodal language dataset for understanding humor. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2046–2056, Hong Kong, China, November 2019. Association for Computational Linguistics.
- Multimodal humor dataset: Predicting laughter tracks for sitcoms. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 576–585, January 2021.
- Marioqa: Answering questions by watching gameplay videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2867–2875, 2017.
- Video Question Answering with Spatio-Temporal Reasoning. IJCV, 2019.
- Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
- Clevrer: Collision events for video representation and reasoning, 2020.
- Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
- Tvqa+: Spatio-temporal grounding for video question answering. In Tech Report, arXiv, 2019.
- Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019.
- Next-qa:next phase of question-answering to explaining temporal actions, 2021.
- Knowit vqa: Answering knowledge-based questions about videos. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
- Agqa: A benchmark for compositional spatio-temporal reasoning, 2021.
- Avqa: A dataset for audio-visual question answering on videos. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3480–3491, 2022.
- STAR: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Env-qa: A video question answering benchmark for comprehensive understanding of dynamic environments. pages 1675–1685, October 2021.
- Fiber: Fill-in-the-blanks as a challenging video understanding evaluation framework, 2022.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017.
- Visual7w: Grounded question answering in images, 2016.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
- Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8658–8665, 2019.
- Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI, volume 2, page 8, 2018.
- Less is more: Clipbert for video-and-language learning via sparse sampling–supplementary file.
- Look before you speak: Visually contextualized utterances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16877–16887, 2021.
- Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
- Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. arXiv preprint arXiv:2212.09522, 2022.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
- Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt. https://github.com/mbzuai-oryx/Video-ChatGPT, 2023.
- Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images, 2023.
- Memegraphs: Linking memes to knowledge graphs, 2023.
- Do androids laugh at electric sheep? humor "understanding" benchmarks from the new yorker caption contest, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Immanuel Kant. Critique of judgment. Hackett Publishing, 1987.
- John Morreall. Philosophy of Humor. In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2023 edition, 2023.
- Robert L Latta. The basic humor process: A cognitive-shift theory and the case against incongruity. De Gruyter Mouton, 1999.
- Brian Boyd. Laughter and literature: A play theory of humor. Philosophy and literature, 28(1):1–22, 2004.
- Arthur Koestler. The act of creation. In Brain Function, Volume IV: Brain Function and Learning, pages 327–346. University of California Press, 2020.
- Nippon Television Network Corporation. Kasou taishou. https://www.ntv.co.jp/kasoh/index.html. [Accessed 23-Apr-2023].
- The standard definition of creativity. CREATIVITY RESEARCH JOURNAL, 24(1):92–96, 2012.
- Henning Nelms. Magic and showmanship: A handbook for conjurers. Courier Corporation, 2012.
- Magic in theory: An introduction to the theoretical and psychological elements of conjuring. Univ of Hertfordshire Press, 2005.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134, 2019.
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections, 2022.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
- Florence: A new foundation model for computer vision, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- Eva-clip: Improved training techniques for clip at scale, 2023.
- Internvideo: General video foundation models via generative and discriminative learning, 2022.
- Llama: Open and efficient foundation language models, 2023.
- Learning transferable visual models from natural language supervision, 2021.
- Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- Learning compact metrics for mt. In Proceedings of EMNLP, 2021.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
- Evaluating open-domain question answering in the era of large language models. arXiv preprint arXiv:2305.06984, 2023.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Panoptic video scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18675–18685, 2023.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.