TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning (2402.19467v4)
Abstract: It is challenging for models to understand complex, multimodal content such as television clips, and this is in part because video-LLMs often rely on single-modality reasoning and lack interpretability. To combat these issues we propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by searching for trees of entailment relationships between simple text-video evidence and higher-level conclusions that prove question-answer pairs. We also introduce the task of multimodal entailment tree generation to evaluate reasoning quality. Our method's performance on the challenging TVQA benchmark demonstrates interpretable, state-of-the-art zero-shot performance on full clips, illustrating that multimodal entailment tree generation can be a best-of-both-worlds alternative to black-box systems.
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
- Natural language deduction through search over statement compositions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4871–4883, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- A large annotated corpus for learning natural language inference. CoRR, abs/1508.05326.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Evaluation of text generation: A survey.
- Explainable video entailment with grounded visual evidence. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- e-snli-ve: Corrected visual-textual entailment with natural language explanations. arXiv preprint arXiv:2004.03744.
- Graph-based multi-interaction network for video question answering. IEEE Transactions on Image Processing, 30:2758–2770.
- Explainable deep learning for video recognition tasks: A framework & recommendations. arXiv preprint arXiv:1909.05667.
- Ralph H. Johnson and J. Anthony Blair. 1977. Logical self-defense.
- Khushboo Khurana and Umesh Deshpande. 2021. Video question-answering techniques, benchmark datasets and evaluation metrics leveraging video captioning: a comprehensive survey. IEEE Access, 9:43799–43823.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
- Large language models are temporal and causal reasoners for video question answering. arXiv preprint arXiv:2310.15747.
- Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696.
- Tvqa+: Spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574.
- Adaptive hierarchical graph reasoning with semantic coherence for video-and-language inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1867–1877.
- Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200.
- Scene-text oriented visual entailment: Task, dataset and solution. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5562–5571.
- Towards visually explaining video understanding networks with perturbation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1120–1129.
- Visual instruction tuning.
- Violin: A large-scale dataset for video-and-language inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10900–10910.
- Cross-attentional spatio-temporal semantic graph networks for video question answering. IEEE Transactions on Image Processing, 31:1684–1696.
- Dynamic multistep reasoning based on video scene graph for video question answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3894–3904.
- Automated evaluation of written discourse coherence using gpt-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 394–403.
- Entailment tree explanations via iterative retrieval-generation reasoner. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 465–475, Seattle, United States. Association for Computational Linguistics.
- Don’t explain without verifying veracity: an evaluation of explainable ai with video activity recognition. arXiv preprint arXiv:2005.02335.
- Revealing the illusion of joint multimodal understanding in videoqa models. arXiv preprint arXiv:2306.08889.
- Explainable activity recognition in videos. In IUI Workshops, volume 2.
- Are vision-language transformers learning multimodal representations? a probing perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11248–11257.
- Long-form video-language pre-training with multimodal temporal contrastive learning. Advances in neural information processing systems, 35:38032–38045.
- Multimodal logical inference system for visual-textual entailment. arXiv preprint arXiv:1906.03952.
- Entailer: Answering questions with faithful and truthful chains of reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2078–2093, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Fine-grained visual entailment. In European Conference on Computer Vision, pages 398–416. Springer.
- Dualvgr: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia, 24:3369–3380.
- Distilled dual-encoder model for vision-language understanding. arXiv preprint arXiv:2112.08723.
- Enhancing systematic decompositional natural language inference using informal logic. arXiv preprint arXiv:2402.14798.
- Nathaniel Weir and Benjamin Van Durme. 2023. Dynamic generation of grounded logical explanations in a neuro-symbolic expert system.
- Can i trust your answer? visually grounded video question answering. arXiv preprint arXiv:2309.01327.
- Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
- Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141.
- Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988.
- Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817.
- Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology.
- Open-ended video question answering via multi-modal conditional adversarial networks. IEEE Transactions on Image Processing, 29:3859–3870.
- Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI, volume 2, page 8.
- Video question answering: Datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225.
- Explainable video action reasoning via prior knowledge and state transitions. In Proceedings of the 27th acm international conference on multimedia, pages 521–529.
- Yeyun Zou and Qiyu Xie. 2020. A survey on vqa: Datasets and approaches. In 2020 2nd International Conference on Information Technology and Computer Application (ITCA), pages 289–297. IEEE.
- Kate Sanders (19 papers)
- Nathaniel Weir (17 papers)
- Benjamin Van Durme (173 papers)