SnAG: Scalable and Accurate Video Grounding (2404.02257v2)
Abstract: Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.
- Localizing moments in video with natural language. In ICCV, 2017.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Localizing moments in long video via multimodal guidance. In ICCV, 2023.
- Soft-NMS–improving object detection with one line of code. In ICCV, 2017.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- Temporally grounding natural sentence in video. In EMNLP, 2018.
- Rethinking the bottom-up framework for query-based video localization. In AAAI, 2020.
- Semantic proposal for activity localization in videos via sentence query. In AAAI, 2019.
- Rethinking attention with performers. In ICLR, 2021.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
- You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In CVPR, 2023.
- SlowFast networks for video recognition. In ICCV, 2019.
- TALL: Temporal activity localization via language query. In ICCV, 2017.
- Relation-aware video reading comprehension for temporal language grounding. In EMNLP, 2021.
- ExCL: Extractive clip localization using natural language descriptions. In ACL, 2019.
- Ego4D: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
- CONE: An efficient coarse-to-fine alignment framework for long video temporal grounding. In ACL, 2023.
- Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Dense-captioning events in videos. In ICCV, 2017.
- Detecting moments and highlights in videos via natural language queries. NeurIPS, 2021.
- G2L: Semantically aligned and uniform video grounding via geodesic and game theory. In ICCV, 2023.
- Compositional temporal grounding with structured variational cross-graph correspondence learning. In CVPR, 2022.
- Proposal-free video grounding with contextual pyramid network. In AAAI, 2021.
- UniVTG: Towards unified video-language temporal grounding. In ICCV, 2023.
- Focal loss for dense object detection. In ICCV, 2017.
- Memory-guided semantic learning network for temporal sentence grounding. In AAAI, 2022.
- Adaptive proposal generation network for temporal sentence localization in videos. In EMNLP, 2021.
- Context-aware biaffine localizing network for temporal sentence grounding. In CVPR, 2021.
- Progressively guide to attend: An iterative alignment framework for temporal sentence grounding. In EMNLP, 2021.
- DEBUG: A dense bottom-up grounding approach for natural language video localization. In EMNLP, 2019.
- Learning activity progression in lstms for activity detection and early detection. In CVPR, 2016.
- Local-global video-text interactions for temporal grounding. In CVPR, 2020.
- Interventional video grounding with dual contrastive learning. In CVPR, 2021.
- Uncovering hidden challenges in query-based video moment retrieval. In BMVC, 2020.
- Scanning only once: An end-to-end framework for fast temporal grounding in long videos. In ICCV, 2023.
- GloVe: Global vectors for word representation. In EMNLP, 2014.
- Egocentric video-language pretraining. NeurIPS, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Grounding action descriptions in videos. TACL, 2013.
- Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV, 2020.
- Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 2016.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- A multi-stream bi-directional recurrent neural network for fine-grained action detection. In CVPR, 2016.
- MAD: A scalable dataset for language grounding in videos from movie audio descriptions. In CVPR, 2022.
- VLG-net: Video-language graph matching network for video grounding. In ICCVW, 2021.
- FCOS: Fully convolutional one-stage object detection. In ICCV, 2019.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
- Attention is all you need. NeurIPS, 2017.
- Structured multi-level interaction network for video moment localization via language query. In CVPR, 2021.
- Temporally grounding language queries in videos by contextual boundary-aware prediction. In AAAI, 2020.
- Negative sample matters: A renaissance of metric learning for temporal grounding. In AAAI, 2022.
- Explore and match: End-to-end video grounding with transformer. arXiv preprint arXiv:2201.10168, 2022.
- Natural language video localization with learnable moment proposals. In EMNLP, 2021.
- Boundary proposal network for two-stage natural language video localization. In AAAI, 2021.
- Multilevel language and vision integration for text-to-clip retrieval. In AAAI, 2019.
- A closer look at temporal sentence grounding in videos: Dataset and metric. In ACM MM-HUMA Workshop, 2021.
- Semantic conditioned dynamic modulation for temporal sentence grounding in videos. NeurIPS, 2019.
- To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI, 2019.
- Dense regression network for video grounding. In CVPR, 2020.
- ActionFormer: Localizing moments of actions with transformers. In ECCV, 2022.
- MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR, 2019.
- Parallel attention network with sequence matching for video grounding. In Findings of ACL, 2021.
- Span-based localizing network for natural language video localization. In ACL, 2020.
- Natural language video moment localization through query-controlled temporal convolution. In WACV, 2022.
- Multi-stage aggregated transformer network for temporal language localization in videos. In CVPR, 2021.
- Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, 2020.
- Cascaded prediction network via segment tree for temporal video grounding. In CVPR, 2021.
- Distance-IoU loss: Faster and better learning for bounding box regression. In AAAI, 2020.
- Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In CVPR, 2021.
- Rethinking the video sampling and reasoning strategies for temporal sentence grounding. In EMNLP Findings, 2022.
- Fangzhou Mu (18 papers)
- Sicheng Mo (9 papers)
- Yin Li (150 papers)