Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FunQA: Towards Surprising Video Comprehension (2306.14899v2)

Published 26 Jun 2023 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question-answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA, 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model's capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counter-intuitiveness. We also pose higher-level tasks, such as attributing a fitting and vivid title to the video and scoring the video creativity. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Moreover, we propose FunMentor, an agent designed for Vision-LLMs (VLMs) that uses multi-turn dialogues to enhance models' understanding of counter-intuitiveness. Extensive experiments with existing VLMs demonstrate the effectiveness of FunMentor and reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. On the valence of surprise. Cognition & emotion, 27(7):1326–1334, 2013.
  2. Mike W Martin. Humour and aesthetic enjoyment of incongruities. The British Journal of Aesthetics, 23(1):74–85, 1983.
  3. Charles R Gruner. Understanding laughter: The workings of wit & humor. Burnham Incorporated Pub, 1978.
  4. Michael Billig. Laughter and ridicule: Towards a social critique of humour. Sage, 2005.
  5. Salvatore Attardo. A primer for the linguistics of humor. The primer of humor research, 8:101–156, 2008.
  6. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  7. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  8. Towards multimodal sarcasm detection (an _obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy, 7 2019. Association for Computational Linguistics.
  9. UR-FUNNY: A multimodal language dataset for understanding humor. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2046–2056, Hong Kong, China, November 2019. Association for Computational Linguistics.
  10. Multimodal humor dataset: Predicting laughter tracks for sitcoms. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 576–585, January 2021.
  11. Marioqa: Answering questions by watching gameplay videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2867–2875, 2017.
  12. Video Question Answering with Spatio-Temporal Reasoning. IJCV, 2019.
  13. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
  14. Clevrer: Collision events for video representation and reasoning, 2020.
  15. Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
  16. Tvqa+: Spatio-temporal grounding for video question answering. In Tech Report, arXiv, 2019.
  17. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019.
  18. Next-qa:next phase of question-answering to explaining temporal actions, 2021.
  19. Knowit vqa: Answering knowledge-based questions about videos. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  20. Agqa: A benchmark for compositional spatio-temporal reasoning, 2021.
  21. Avqa: A dataset for audio-visual question answering on videos. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3480–3491, 2022.
  22. STAR: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  23. Env-qa: A video question answering benchmark for comprehensive understanding of dynamic environments. pages 1675–1685, October 2021.
  24. Fiber: Fill-in-the-blanks as a challenging video understanding evaluation framework, 2022.
  25. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017.
  26. Visual7w: Grounded question answering in images, 2016.
  27. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  28. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8658–8665, 2019.
  29. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI, volume 2, page 8, 2018.
  30. Less is more: Clipbert for video-and-language learning via sparse sampling–supplementary file.
  31. Look before you speak: Visually contextualized utterances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16877–16887, 2021.
  32. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  33. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. arXiv preprint arXiv:2212.09522, 2022.
  34. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  35. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  36. Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt. https://github.com/mbzuai-oryx/Video-ChatGPT, 2023.
  37. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images, 2023.
  38. Memegraphs: Linking memes to knowledge graphs, 2023.
  39. Do androids laugh at electric sheep? humor "understanding" benchmarks from the new yorker caption contest, 2022.
  40. OpenAI. Gpt-4 technical report, 2023.
  41. Immanuel Kant. Critique of judgment. Hackett Publishing, 1987.
  42. John Morreall. Philosophy of Humor. In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2023 edition, 2023.
  43. Robert L Latta. The basic humor process: A cognitive-shift theory and the case against incongruity. De Gruyter Mouton, 1999.
  44. Brian Boyd. Laughter and literature: A play theory of humor. Philosophy and literature, 28(1):1–22, 2004.
  45. Arthur Koestler. The act of creation. In Brain Function, Volume IV: Brain Function and Learning, pages 327–346. University of California Press, 2020.
  46. Nippon Television Network Corporation. Kasou taishou. https://www.ntv.co.jp/kasoh/index.html. [Accessed 23-Apr-2023].
  47. The standard definition of creativity. CREATIVITY RESEARCH JOURNAL, 24(1):92–96, 2012.
  48. Henning Nelms. Magic and showmanship: A handbook for conjurers. Courier Corporation, 2012.
  49. Magic in theory: An introduction to the theoretical and psychological elements of conjuring. Univ of Hertfordshire Press, 2005.
  50. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134, 2019.
  51. mplug: Effective and efficient vision-language learning by cross-modal skip-connections, 2022.
  52. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  53. Florence: A new foundation model for computer vision, 2021.
  54. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  55. Eva-clip: Improved training techniques for clip at scale, 2023.
  56. Internvideo: General video foundation models via generative and discriminative learning, 2022.
  57. Llama: Open and efficient foundation language models, 2023.
  58. Learning transferable visual models from natural language supervision, 2021.
  59. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
  60. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  61. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  62. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  63. Learning compact metrics for mt. In Proceedings of EMNLP, 2021.
  64. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023.
  65. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
  66. Evaluating open-domain question answering in the era of large language models. arXiv preprint arXiv:2305.06984, 2023.
  67. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
  68. Panoptic video scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18675–18685, 2023.
  69. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
Citations (12)

Summary

We haven't generated a summary for this paper yet.