Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Commonsense for Zero-Shot Natural Language Video Localization (2312.17429v2)

Published 29 Dec 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Visual Consensus Modeling for Video-Text Retrieval. In AAAI Conference on Artificial Intelligence, 167–175.
  2. Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language. In ECCV, 601–618.
  3. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. ArXiv:2001.09308.
  4. Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language. In NeurIPS, 887–899.
  5. Weakly Supervised Dense Event Captioning in Videos. In NeurIPS, 3062–3072.
  6. Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning. In EMNLP, 840–860.
  7. TALL: Temporal Activity Localization via Language Query. In ICCV.
  8. Relation-aware Video Reading Comprehension for Temporal Language Grounding. In EMNLP, 3978–3988.
  9. Fast Video Moment Retrieval. In ICCV, 1503–1512.
  10. WSLLN:Weakly Supervised Natural Language Localization Networks. In EMNLP-IJCNLP, 1481–1487.
  11. End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus. In CIKM, 3958–3962.
  12. The ”Something Something” Video Database for Learning and Evaluating Visual Common Sense. In ICCV, 5842–5850.
  13. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30: 129–150.
  14. Knowledge Enhanced Coreference Resolution via Gated Attention. In IEEE BIBM, 2287–2293.
  15. ActivityNet: A Large-scale Video Benchmark for Human Activity Understanding. In CVPR.
  16. Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation. In ICCV), 7199–7208.
  17. Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding. In CVPR, 15513–15523.
  18. Language-free Training for Zero-shot Video Grounding. In WACV, 2539–2548.
  19. Dense-Captioning Events in Videos. In ICCV.
  20. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123: 32–73.
  21. Detecting Moments and Highlights in Videos via Natural Language Queries. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., NeurIPS, 11846–11858.
  22. TVQA+: Spatio-Temporal Grounding for Video Question Answering. In ACL, 8211–8225.
  23. What is More Likely to Happen Next? Video-and-Language Future Event Prediction. In EMNLP, 8769–8784.
  24. From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question-Answering. In CVPR, 21273–21282.
  25. Proposal-free Video Grounding with Contextual Pyramid Network. In AAAI Conference on Artificial Intelligence, 1902–1910.
  26. MomentDiff: Generative Video Moment Retrieval from Random to Real. In NeurIPS.
  27. Weakly-Supervised Video Moment Retrieval via Semantic Completion Network. In AAAI.
  28. Context-Aware Biaffine Localizing Network for Temporal Sentence Grounding. In CVPR, 11235–11244.
  29. Unsupervised Temporal Video Grounding with Deep Semantic Clustering. In AAAI Conference on Artificial Intelligence, 1683–1691.
  30. Exploring Motion and Appearance Information for Temporal Sentence Grounding. In AAAI Conference on Artificial Intelligence, 1674–1682.
  31. Violin: A Large-Scale Dataset for Video-and-Language Inference. In CVPR, 10900–10910.
  32. VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval. In ECCV, 156–171.
  33. Integrating Visuospatial, Linguistic, and Commonsense Structure into Story Visualization. In EMNLP, 6772–6786.
  34. Weakly Supervised Video Moment Retrieval from Text Queries. In CVPR, 11592–11601.
  35. Local-Global Video-Text Interactions for Temporal Grounding. In CVPR.
  36. Zero-Shot Natural Language Video Localization. In ICCV, 1470–1479.
  37. Exposing the Limits of Video-Text Models through Contrast Sets. In NAACL-HTL, 3574–3586.
  38. The Multi-Modal Video Reasoning and Analyzing Competition. In ICCV Workshops (ICCVW), 806–813.
  39. Locate before Answering: Answer Guided Question Localization for Video Question Answering. IEEE Transactions on Multimedia.
  40. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 8748–8763.
  41. What happens before and after: Multi-Event Commonsense in Event Coreference Resolution. In EACL, 1708–1724.
  42. Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention. In WACV.
  43. DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video. In WACV, 1079–1088.
  44. Modeling Relational Data with Graph Convolutional Networks. In The Semantic Web, 593–607.
  45. CogME: A Novel Evaluation Metric for Video Understanding Intelligence. ArXiv:2107.09847.
  46. VLG-Net: Video-Language Graph Matching Network for Video Grounding. In ICCV, 3224–3234.
  47. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In AAAI, 4444–4451.
  48. LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval. In WACV, 2083–2092.
  49. Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV, 4489–4497.
  50. Attention is All you Need. In NeurIPS.
  51. Temporally Grounding Language Queries in Videos by Contextual Boundary-Aware Prediction. In AAAI Conference on Artificial Intelligence, 12168–12175.
  52. Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding. In Findings of EMNLP, 89–99.
  53. Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding. ArXiv:2204.01450.
  54. Boundary Proposal Network for Two-Stage Natural Language Video Localization. In AAAI Conference on Artificial Intelligence.
  55. Deconfounded Video Moment Retrieval with Causal Intervention. In ACM SIGIR, 1–10.
  56. Hybrid Reasoning Network for Video-based Commonsense Captioning. In ACM MM, 5213–5221.
  57. Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding. Neural Processing Letters, 52(3): 1863–1879.
  58. Dense Regression Network for Video Grounding. In CVPR.
  59. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. In CVPR, 2215–2224.
  60. Video Corpus Moment Retrieval with Contrastive Learning. In ACM SIGIR, 685–695.
  61. Natural Language Video Moment Localization Through Query-Controlled Temporal Convolution. In WACV, 2524–2532.
  62. Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding. In NeurIPS, 18123–18134.
  63. Cascaded Prediction Network via Segment Tree for Temporal Video Grounding. In CVPR, 4197–4206.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com