Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SnAG: Scalable and Accurate Video Grounding (2404.02257v2)

Published 2 Apr 2024 in cs.CV

Abstract: Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Localizing moments in video with natural language. In ICCV, 2017.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Localizing moments in long video via multimodal guidance. In ICCV, 2023.
  4. Soft-NMS–improving object detection with one line of code. In ICCV, 2017.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  6. Temporally grounding natural sentence in video. In EMNLP, 2018.
  7. Rethinking the bottom-up framework for query-based video localization. In AAAI, 2020.
  8. Semantic proposal for activity localization in videos via sentence query. In AAAI, 2019.
  9. Rethinking attention with performers. In ICLR, 2021.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
  11. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In CVPR, 2023.
  12. SlowFast networks for video recognition. In ICCV, 2019.
  13. TALL: Temporal activity localization via language query. In ICCV, 2017.
  14. Relation-aware video reading comprehension for temporal language grounding. In EMNLP, 2021.
  15. ExCL: Extractive clip localization using natural language descriptions. In ACL, 2019.
  16. Ego4D: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
  17. CONE: An efficient coarse-to-fine alignment framework for long video temporal grounding. In ACL, 2023.
  18. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  19. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  20. Dense-captioning events in videos. In ICCV, 2017.
  21. Detecting moments and highlights in videos via natural language queries. NeurIPS, 2021.
  22. G2L: Semantically aligned and uniform video grounding via geodesic and game theory. In ICCV, 2023.
  23. Compositional temporal grounding with structured variational cross-graph correspondence learning. In CVPR, 2022.
  24. Proposal-free video grounding with contextual pyramid network. In AAAI, 2021.
  25. UniVTG: Towards unified video-language temporal grounding. In ICCV, 2023.
  26. Focal loss for dense object detection. In ICCV, 2017.
  27. Memory-guided semantic learning network for temporal sentence grounding. In AAAI, 2022.
  28. Adaptive proposal generation network for temporal sentence localization in videos. In EMNLP, 2021.
  29. Context-aware biaffine localizing network for temporal sentence grounding. In CVPR, 2021.
  30. Progressively guide to attend: An iterative alignment framework for temporal sentence grounding. In EMNLP, 2021.
  31. DEBUG: A dense bottom-up grounding approach for natural language video localization. In EMNLP, 2019.
  32. Learning activity progression in lstms for activity detection and early detection. In CVPR, 2016.
  33. Local-global video-text interactions for temporal grounding. In CVPR, 2020.
  34. Interventional video grounding with dual contrastive learning. In CVPR, 2021.
  35. Uncovering hidden challenges in query-based video moment retrieval. In BMVC, 2020.
  36. Scanning only once: An end-to-end framework for fast temporal grounding in long videos. In ICCV, 2023.
  37. GloVe: Global vectors for word representation. In EMNLP, 2014.
  38. Egocentric video-language pretraining. NeurIPS, 2022.
  39. Learning transferable visual models from natural language supervision. In ICML, 2021.
  40. Grounding action descriptions in videos. TACL, 2013.
  41. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV, 2020.
  42. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 2016.
  43. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
  44. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In CVPR, 2016.
  45. MAD: A scalable dataset for language grounding in videos from movie audio descriptions. In CVPR, 2022.
  46. VLG-net: Video-language graph matching network for video grounding. In ICCVW, 2021.
  47. FCOS: Fully convolutional one-stage object detection. In ICCV, 2019.
  48. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  49. Attention is all you need. NeurIPS, 2017.
  50. Structured multi-level interaction network for video moment localization via language query. In CVPR, 2021.
  51. Temporally grounding language queries in videos by contextual boundary-aware prediction. In AAAI, 2020.
  52. Negative sample matters: A renaissance of metric learning for temporal grounding. In AAAI, 2022.
  53. Explore and match: End-to-end video grounding with transformer. arXiv preprint arXiv:2201.10168, 2022.
  54. Natural language video localization with learnable moment proposals. In EMNLP, 2021.
  55. Boundary proposal network for two-stage natural language video localization. In AAAI, 2021.
  56. Multilevel language and vision integration for text-to-clip retrieval. In AAAI, 2019.
  57. A closer look at temporal sentence grounding in videos: Dataset and metric. In ACM MM-HUMA Workshop, 2021.
  58. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. NeurIPS, 2019.
  59. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI, 2019.
  60. Dense regression network for video grounding. In CVPR, 2020.
  61. ActionFormer: Localizing moments of actions with transformers. In ECCV, 2022.
  62. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR, 2019.
  63. Parallel attention network with sequence matching for video grounding. In Findings of ACL, 2021.
  64. Span-based localizing network for natural language video localization. In ACL, 2020.
  65. Natural language video moment localization through query-controlled temporal convolution. In WACV, 2022.
  66. Multi-stage aggregated transformer network for temporal language localization in videos. In CVPR, 2021.
  67. Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, 2020.
  68. Cascaded prediction network via segment tree for temporal video grounding. In CVPR, 2021.
  69. Distance-IoU loss: Faster and better learning for bounding box regression. In AAAI, 2020.
  70. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In CVPR, 2021.
  71. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. In EMNLP Findings, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Fangzhou Mu (18 papers)
  2. Sicheng Mo (9 papers)
  3. Yin Li (150 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com