Commonsense for Zero-Shot Natural Language Video Localization (2312.17429v2)
Abstract: Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.
- Visual Consensus Modeling for Video-Text Retrieval. In AAAI Conference on Artificial Intelligence, 167–175.
- Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language. In ECCV, 601–618.
- Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. ArXiv:2001.09308.
- Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language. In NeurIPS, 887–899.
- Weakly Supervised Dense Event Captioning in Videos. In NeurIPS, 3062–3072.
- Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning. In EMNLP, 840–860.
- TALL: Temporal Activity Localization via Language Query. In ICCV.
- Relation-aware Video Reading Comprehension for Temporal Language Grounding. In EMNLP, 3978–3988.
- Fast Video Moment Retrieval. In ICCV, 1503–1512.
- WSLLN:Weakly Supervised Natural Language Localization Networks. In EMNLP-IJCNLP, 1481–1487.
- End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus. In CIKM, 3958–3962.
- The ”Something Something” Video Database for Learning and Evaluating Visual Common Sense. In ICCV, 5842–5850.
- Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30: 129–150.
- Knowledge Enhanced Coreference Resolution via Gated Attention. In IEEE BIBM, 2287–2293.
- ActivityNet: A Large-scale Video Benchmark for Human Activity Understanding. In CVPR.
- Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation. In ICCV), 7199–7208.
- Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding. In CVPR, 15513–15523.
- Language-free Training for Zero-shot Video Grounding. In WACV, 2539–2548.
- Dense-Captioning Events in Videos. In ICCV.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123: 32–73.
- Detecting Moments and Highlights in Videos via Natural Language Queries. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., NeurIPS, 11846–11858.
- TVQA+: Spatio-Temporal Grounding for Video Question Answering. In ACL, 8211–8225.
- What is More Likely to Happen Next? Video-and-Language Future Event Prediction. In EMNLP, 8769–8784.
- From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question-Answering. In CVPR, 21273–21282.
- Proposal-free Video Grounding with Contextual Pyramid Network. In AAAI Conference on Artificial Intelligence, 1902–1910.
- MomentDiff: Generative Video Moment Retrieval from Random to Real. In NeurIPS.
- Weakly-Supervised Video Moment Retrieval via Semantic Completion Network. In AAAI.
- Context-Aware Biaffine Localizing Network for Temporal Sentence Grounding. In CVPR, 11235–11244.
- Unsupervised Temporal Video Grounding with Deep Semantic Clustering. In AAAI Conference on Artificial Intelligence, 1683–1691.
- Exploring Motion and Appearance Information for Temporal Sentence Grounding. In AAAI Conference on Artificial Intelligence, 1674–1682.
- Violin: A Large-Scale Dataset for Video-and-Language Inference. In CVPR, 10900–10910.
- VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval. In ECCV, 156–171.
- Integrating Visuospatial, Linguistic, and Commonsense Structure into Story Visualization. In EMNLP, 6772–6786.
- Weakly Supervised Video Moment Retrieval from Text Queries. In CVPR, 11592–11601.
- Local-Global Video-Text Interactions for Temporal Grounding. In CVPR.
- Zero-Shot Natural Language Video Localization. In ICCV, 1470–1479.
- Exposing the Limits of Video-Text Models through Contrast Sets. In NAACL-HTL, 3574–3586.
- The Multi-Modal Video Reasoning and Analyzing Competition. In ICCV Workshops (ICCVW), 806–813.
- Locate before Answering: Answer Guided Question Localization for Video Question Answering. IEEE Transactions on Multimedia.
- Learning Transferable Visual Models from Natural Language Supervision. In ICML, 8748–8763.
- What happens before and after: Multi-Event Commonsense in Event Coreference Resolution. In EACL, 1708–1724.
- Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention. In WACV.
- DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video. In WACV, 1079–1088.
- Modeling Relational Data with Graph Convolutional Networks. In The Semantic Web, 593–607.
- CogME: A Novel Evaluation Metric for Video Understanding Intelligence. ArXiv:2107.09847.
- VLG-Net: Video-Language Graph Matching Network for Video Grounding. In ICCV, 3224–3234.
- ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In AAAI, 4444–4451.
- LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval. In WACV, 2083–2092.
- Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV, 4489–4497.
- Attention is All you Need. In NeurIPS.
- Temporally Grounding Language Queries in Videos by Contextual Boundary-Aware Prediction. In AAAI Conference on Artificial Intelligence, 12168–12175.
- Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding. In Findings of EMNLP, 89–99.
- Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding. ArXiv:2204.01450.
- Boundary Proposal Network for Two-Stage Natural Language Video Localization. In AAAI Conference on Artificial Intelligence.
- Deconfounded Video Moment Retrieval with Causal Intervention. In ACM SIGIR, 1–10.
- Hybrid Reasoning Network for Video-based Commonsense Captioning. In ACM MM, 5213–5221.
- Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding. Neural Processing Letters, 52(3): 1863–1879.
- Dense Regression Network for Video Grounding. In CVPR.
- Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. In CVPR, 2215–2224.
- Video Corpus Moment Retrieval with Contrastive Learning. In ACM SIGIR, 685–695.
- Natural Language Video Moment Localization Through Query-Controlled Temporal Convolution. In WACV, 2524–2532.
- Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding. In NeurIPS, 18123–18134.
- Cascaded Prediction Network via Segment Tree for Temporal Video Grounding. In CVPR, 4197–4206.