Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding (2403.11463v2)

Published 18 Mar 2024 in cs.CV

Abstract: Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (108)
  1. Localizing moments in video with natural language. In ICCV, 2017.
  2. Masked siamese networks for label-efficient learning. In ECCV, 2022.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  4. Dense events grounding in video. In AAAI, 2021.
  5. Fully-convolutional siamese networks for object tracking. In ECCV Workshops, 2016.
  6. Signature verification using a" siamese" time delay neural network. In NeurIPS, 1993.
  7. On pursuit of designing multi-modal transformer for video grounding. In EMNLP, 2021.
  8. Iterative proposal refinement for weakly-supervised video grounding. In CVPR, 2023.
  9. End-to-end object detection with transformers. In ECCV, 2020.
  10. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  11. Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding. In AAAI, 2022.
  12. Rethinking the bottom-up framework for query-based video localization. In AAAI, 2020a.
  13. A simple framework for contrastive learning of visual representations. In ICML, 2020b.
  14. Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020c.
  15. Exploring simple siamese representation learning. In CVPR, 2021.
  16. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
  17. Partially relevant video retrieval. In ACMMM, 2022.
  18. Weakly supervised dense event captioning in videos. In NeurIPS, 2018.
  19. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR, 2019.
  20. Beam search strategies for neural machine translation. In ACL, 2017.
  21. Multi-modal transformer for video retrieval. In ECCV, 2020.
  22. Tall: Temporal activity localization via language query. In ICCV, 2017.
  23. X-pool: Cross-modal language-video attention for text-video retrieval. In CVPR, 2022.
  24. Siamese masked autoencoders. arXiv preprint arXiv:2305.14344, 2023.
  25. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  26. Cross-sentence temporal and semantic relations in video activity localisation. In ICCV, 2021.
  27. Improving action segmentation via graph-based temporal reasoning. In CVPR, 2020.
  28. Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In CVPR, 2023.
  29. Knowing where to focus: Event-aware transformer for video grounding. In ICCV, 2023.
  30. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
  31. Semi-supervised video paragraph grounding with contrastive encoder. In CVPR, 2022.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. Siamese neural networks for one-shot image recognition. In ICML Workshops, 2015.
  34. Dense-captioning events in videos. In ICCV, 2017.
  35. Unsupervised action segmentation by joint representation learning and online clustering. In CVPR, 2022.
  36. Hierarchical conditional relation networks for video question answering. In CVPR, 2020.
  37. Temporal convolutional networks for action segmentation and detection. In CVPR, 2017.
  38. Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
  39. Dn-detr: Accelerate detr training by introducing query denoising. In CVPR, 2022a.
  40. G2l: Semantically aligned and uniform video grounding via geodesic and game theory. In ICCV, 2023.
  41. Invariant grounding for video question answering. In CVPR, 2022b.
  42. Weakly-supervised video moment retrieval via semantic completion network. In AAAI, 2020.
  43. Skimming, locating, then perusing: A human-like framework for natural language video localization. In ACMMM, 2022.
  44. Memory-guided semantic learning network for temporal sentence grounding. In AAAI, 2022.
  45. Diffusion action segmentation. In ICCV, 2023.
  46. Dab-detr: Dynamic anchor boxes are better queries for detr. In ICLR, 2021.
  47. Debug: A dense bottom-up grounding approach for natural language video localization. In EMNLP, 2019.
  48. Self-supervised learning for semi-supervised temporal language grounding. TMM, 2022.
  49. Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In ECCV, 2020.
  50. Unsupervised video summarization with adversarial lstm networks. In CVPR, 2017.
  51. Conditional detr for fast training convergence. In ICCV, 2021.
  52. Weakly supervised video moment retrieval from text queries. In CVPR, 2019.
  53. Local-global video-text interactions for temporal grounding. In CVPR, 2020.
  54. Bassl: Boundary-aware self-supervised learning for video scene segmentation. In ACCV, 2022.
  55. Clip-it! language-guided video summarization. In NeurIPS, 2021.
  56. Glove: Global vectors for word representation. In EMNLP, 2014.
  57. Grounding action descriptions in videos. TACL, 2013.
  58. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
  59. Video summarization by learning from unpaired data. In CVPR, 2019.
  60. Video summarization using fully convolutional sequence networks. In ECCV, 2018.
  61. Script data for attribute-based recognition of composite activities. In ECCV, 2012.
  62. End-to-end dense video grounding via parallel regression. arXiv preprint arXiv:2109.11265, 2021.
  63. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
  64. Vlg-net: Video-language graph matching network for video grounding. In CVPR, 2021.
  65. Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv preprint arXiv:2003.07048, 2020.
  66. Deepface: Closing the gap to human-level performance in face verification. In CVPR, 2014.
  67. Hierarchical semantic correspondence networks for video paragraph grounding. In CVPR, 2023.
  68. Logan: Latent graph co-attention network for weakly-supervised video moment retrieval. In WACV, 2021.
  69. Siamese image modeling for self-supervised vision representation learning. In CVPR, 2023.
  70. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  71. Attention is all you need. In NeurIPS, 2017.
  72. Temporal relational modeling with self-supervision for action segmentation. In AAAI, 2021a.
  73. Structured multi-level interaction network for video moment localization via language query. In CVPR, 2021b.
  74. Temporally grounding language queries in videos by contextual boundary-aware prediction. In AAAI, 2020.
  75. Weakly supervised temporal adjacent network for language grounding. TMM, 2022a.
  76. Siamese alignment network for weakly supervised video moment retrieval. TMM, 2023.
  77. Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In ACMMM, 2021c.
  78. Negative sample matters: A renaissance of metric learning for temporal grounding. In AAAI, 2022b.
  79. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In AAAI, 2020.
  80. Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, 2023.
  81. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  82. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021a.
  83. Convolutional hierarchical attention network for query-focused video summarization. In AAAI, 2020.
  84. Natural language video localization with learnable moment proposals. In EMNLP, 2021b.
  85. Boundary proposal network for two-stage natural language video localization. In AAAI, 2021c.
  86. Multilevel language and vision integration for text-to-clip retrieval. In AAAI, 2019.
  87. Boundary-sensitive pre-training for temporal localization in videos. In ICCV, 2021.
  88. Local correspondence network for weakly supervised temporal sentence grounding. TIP, 2021.
  89. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
  90. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In NeurIPS, 2019a.
  91. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI, 2019b.
  92. A closer look at temporal sentence grounding in videos: Dataset and metric. In ACMMM Workshops, 2021.
  93. Dense regression network for video grounding. In CVPR, 2020.
  94. Vl-nms: Breaking proposal bottlenecks in two-stage visual-language matching. TOMM, 2023a.
  95. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR, 2019.
  96. Span-based localizing network for natural language video localization. In ACL, 2020a.
  97. Natural language video localization: A revisit in span-based question answering framework. TPAMI, 2021a.
  98. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In ICLR, 2022.
  99. Video summarization with long short-term memory. In ECCV, 2016.
  100. Multi-stage aggregated transformer network for temporal language localization in videos. In CVPR, 2021b.
  101. Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, 2020b.
  102. Multi-scale 2d temporal adjacency networks for moment localization with natural language. TPAMI, 2021c.
  103. Text-visual prompting for efficient 2d temporal video grounding. In CVPR, 2023b.
  104. Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In ACMMM, 2020c.
  105. Counterfactual contrastive learning for weakly-supervised vision-language grounding. In NeurIPS, 2020d.
  106. Weakly supervised video moment localization with contrastive negative sample mining. In AAAI, 2022a.
  107. Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In CVPR, 2022b.
  108. Image bert pre-training with online tokenizer. In ICLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chaolei Tan (8 papers)
  2. Jianhuang Lai (43 papers)
  3. Wei-Shi Zheng (148 papers)
  4. Jian-Fang Hu (21 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.