Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

YTCommentQA: Video Question Answerability in Instructional Videos (2401.17343v1)

Published 30 Jan 2024 in cs.CV and cs.AI

Abstract: Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. ArXiv, abs/1611.09268.
  2. LifeQA: A Real-life Dataset for Video Question Answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference, 4352–4358. Marseille, France: European Language Resources Association. ISBN 979-10-95546-34-4.
  3. In-the-Wild Video Question Answering. In COLING, 5613–5635. Gyeongju, Republic of Korea: International Committee on Computational Linguistics.
  4. DemoCut: Generating Concise Instructional Videos for Physical Demonstrations. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, UIST ’13, 141–150. New York, NY, USA: Association for Computing Machinery. ISBN 9781450322683.
  5. QuAC: Question Answering in Context. In Conference on Empirical Methods in Natural Language Processing.
  6. TutorialVQA: Question Answering Dataset for Tutorial Videos. In International Conference on Language Resources and Evaluation.
  7. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135.
  8. KnowIT VQA: Answering Knowledge-Based Questions about Videos. In AAAI Conference on Artificial Intelligence.
  9. Google. 2023. YouTube Data API. https://developers.google.com/youtube/v3. Accessed: 2023-08-12.
  10. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning. In CVPR.
  11. VizWiz Grand Challenge: Answering Visual Questions from Blind People. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3608–3617.
  12. Read + Verify: Machine Reading Comprehension with Unanswerable Questions. In AAAI Conference on Artificial Intelligence.
  13. Finding ”It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5948–5957.
  14. Video Question Answering with Spatio-Temporal Reasoning. IJCV.
  15. Khan, S. 2021. question-vs-statement-classifier from huggingface. https://huggingface.co/shahrukhx01/question-vs-statement-classifier. Accessed: 2023-08-12.
  16. Vision And Text Transformer For Predicting Answerability On Visual Question Answering. 2021 IEEE International Conference on Image Processing (ICIP), 934–938.
  17. TVQA: Localized, Compositional Video Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1369–1379. Brussels, Belgium: Association for Computational Linguistics.
  18. VideoChat: Chat-Centric Video Understanding. arXiv preprint arXiv:2305.06355.
  19. Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In Conference on Empirical Methods in Natural Language Processing.
  20. A classification scheme for content analyses of YouTube video comments. J. Documentation, 69: 693–714.
  21. The Promise of Premise: Harnessing Question Premises in Visual Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 926–935. Copenhagen, Denmark: Association for Computational Linguistics.
  22. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV.
  23. RVT-Transformer: Residual Attention in Answerability Prediction on Visual Question Answering for Blind People. In International Conference on Computational Collective Intelligence.
  24. Towards Interpretable and Reliable Reading Comprehension: A Pipeline Model with Unanswerability Prediction. 2021 International Joint Conference on Neural Networks (IJCNN), 1–8.
  25. Nurmanbetov, D. 2021. Bert-restore-punctuation model from huggingface. https://huggingface.co/felflare/bert-restore-punctuation. Accessed: 2023-08-12.
  26. OpenAI. 2022. ChatGPT. https://openai.com/chatgpt. Accessed: 2023-08-12.
  27. OpenAI. 2023. GPT-4 Technical Report. ArXiv, abs/2303.08774.
  28. Analyzing User Comments on YouTube Coding Tutorial Videos. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), 196–206.
  29. Prolific. 2023. Prolific. https://www.prolific.co/. Accessed: 2023-08-12.
  30. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 784–789. Melbourne, Australia: Association for Computational Linguistics.
  31. Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions. In Conference on Empirical Methods in Natural Language Processing.
  32. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, 7: 249–266.
  33. Shuyo, N. 2014. language-detection. https://github.com/shuyo/language-detection. Accessed: 2023-08-12.
  34. Singh, A. 2021. vit-gpt2-image-captioning from huggingface. https://huggingface.co/nlpconnect/vit-gpt2-image-captioning. Accessed: 2023-08-12.
  35. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
  36. MovieQA: Understanding Stories in Movies through Question-Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4631–4640.
  37. Tesseract OCR 2023. 2023. Tesseract OCR. https://github.com/tesseract-ocr/tesseract. Accessed: 2023-08-12.
  38. Question Part Relevance and Editing for Cooperative and Context-Aware VQA (C2VQA). In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, CBMI ’17. New York, NY, USA: Association for Computing Machinery. ISBN 9781450353335.
  39. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv, abs/2307.09288.
  40. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, 191–200. Vancouver, Canada: Association for Computational Linguistics.
  41. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9777–9786.
  42. Just Ask: Learning to Answer Questions from Millions of Narrated Videos. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 1666–1677.
  43. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. In Advances in Neural Information Processing Systems.
  44. BERT Representations for Video Question Answering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 1545–1554.
  45. ”Can You Believe [1:21]?!”: Content and Time-Based Reference Patterns in Video Comments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, 1–12. New York, NY, USA: Association for Computing Machinery. ISBN 9781450359702.
  46. Video Question Answering via Attribute-Augmented Attention Network Learning. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.
  47. Self-Chained Image-Language Model for Video Localization and Question Answering. arXiv:2305.06988.
  48. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering. In AAAI, 9127–9134.
  49. MERLOT: Multimodal Neural Script Knowledge Models. In NeurIPS.
  50. Video Question Answering on Screencast Tutorials. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20. ISBN 9780999241165.
  51. Video Question Answering: Datasets, Algorithms and Challenges. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 6439–6455. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  52. Learning to Ask Unanswerable Questions for Machine Reading Comprehension. In Annual Meeting of the Association for Computational Linguistics.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com