Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding (2403.11463v2)
Abstract: Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
- Localizing moments in video with natural language. In ICCV, 2017.
- Masked siamese networks for label-efficient learning. In ECCV, 2022.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Dense events grounding in video. In AAAI, 2021.
- Fully-convolutional siamese networks for object tracking. In ECCV Workshops, 2016.
- Signature verification using a" siamese" time delay neural network. In NeurIPS, 1993.
- On pursuit of designing multi-modal transformer for video grounding. In EMNLP, 2021.
- Iterative proposal refinement for weakly-supervised video grounding. In CVPR, 2023.
- End-to-end object detection with transformers. In ECCV, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding. In AAAI, 2022.
- Rethinking the bottom-up framework for query-based video localization. In AAAI, 2020a.
- A simple framework for contrastive learning of visual representations. In ICML, 2020b.
- Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020c.
- Exploring simple siamese representation learning. In CVPR, 2021.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
- Partially relevant video retrieval. In ACMMM, 2022.
- Weakly supervised dense event captioning in videos. In NeurIPS, 2018.
- Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR, 2019.
- Beam search strategies for neural machine translation. In ACL, 2017.
- Multi-modal transformer for video retrieval. In ECCV, 2020.
- Tall: Temporal activity localization via language query. In ICCV, 2017.
- X-pool: Cross-modal language-video attention for text-video retrieval. In CVPR, 2022.
- Siamese masked autoencoders. arXiv preprint arXiv:2305.14344, 2023.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Cross-sentence temporal and semantic relations in video activity localisation. In ICCV, 2021.
- Improving action segmentation via graph-based temporal reasoning. In CVPR, 2020.
- Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In CVPR, 2023.
- Knowing where to focus: Event-aware transformer for video grounding. In ICCV, 2023.
- Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
- Semi-supervised video paragraph grounding with contrastive encoder. In CVPR, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Siamese neural networks for one-shot image recognition. In ICML Workshops, 2015.
- Dense-captioning events in videos. In ICCV, 2017.
- Unsupervised action segmentation by joint representation learning and online clustering. In CVPR, 2022.
- Hierarchical conditional relation networks for video question answering. In CVPR, 2020.
- Temporal convolutional networks for action segmentation and detection. In CVPR, 2017.
- Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
- Dn-detr: Accelerate detr training by introducing query denoising. In CVPR, 2022a.
- G2l: Semantically aligned and uniform video grounding via geodesic and game theory. In ICCV, 2023.
- Invariant grounding for video question answering. In CVPR, 2022b.
- Weakly-supervised video moment retrieval via semantic completion network. In AAAI, 2020.
- Skimming, locating, then perusing: A human-like framework for natural language video localization. In ACMMM, 2022.
- Memory-guided semantic learning network for temporal sentence grounding. In AAAI, 2022.
- Diffusion action segmentation. In ICCV, 2023.
- Dab-detr: Dynamic anchor boxes are better queries for detr. In ICLR, 2021.
- Debug: A dense bottom-up grounding approach for natural language video localization. In EMNLP, 2019.
- Self-supervised learning for semi-supervised temporal language grounding. TMM, 2022.
- Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In ECCV, 2020.
- Unsupervised video summarization with adversarial lstm networks. In CVPR, 2017.
- Conditional detr for fast training convergence. In ICCV, 2021.
- Weakly supervised video moment retrieval from text queries. In CVPR, 2019.
- Local-global video-text interactions for temporal grounding. In CVPR, 2020.
- Bassl: Boundary-aware self-supervised learning for video scene segmentation. In ACCV, 2022.
- Clip-it! language-guided video summarization. In NeurIPS, 2021.
- Glove: Global vectors for word representation. In EMNLP, 2014.
- Grounding action descriptions in videos. TACL, 2013.
- Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
- Video summarization by learning from unpaired data. In CVPR, 2019.
- Video summarization using fully convolutional sequence networks. In ECCV, 2018.
- Script data for attribute-based recognition of composite activities. In ECCV, 2012.
- End-to-end dense video grounding via parallel regression. arXiv preprint arXiv:2109.11265, 2021.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- Vlg-net: Video-language graph matching network for video grounding. In CVPR, 2021.
- Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv preprint arXiv:2003.07048, 2020.
- Deepface: Closing the gap to human-level performance in face verification. In CVPR, 2014.
- Hierarchical semantic correspondence networks for video paragraph grounding. In CVPR, 2023.
- Logan: Latent graph co-attention network for weakly-supervised video moment retrieval. In WACV, 2021.
- Siamese image modeling for self-supervised vision representation learning. In CVPR, 2023.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
- Attention is all you need. In NeurIPS, 2017.
- Temporal relational modeling with self-supervision for action segmentation. In AAAI, 2021a.
- Structured multi-level interaction network for video moment localization via language query. In CVPR, 2021b.
- Temporally grounding language queries in videos by contextual boundary-aware prediction. In AAAI, 2020.
- Weakly supervised temporal adjacent network for language grounding. TMM, 2022a.
- Siamese alignment network for weakly supervised video moment retrieval. TMM, 2023.
- Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In ACMMM, 2021c.
- Negative sample matters: A renaissance of metric learning for temporal grounding. In AAAI, 2022b.
- Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In AAAI, 2020.
- Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, 2023.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
- Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021a.
- Convolutional hierarchical attention network for query-focused video summarization. In AAAI, 2020.
- Natural language video localization with learnable moment proposals. In EMNLP, 2021b.
- Boundary proposal network for two-stage natural language video localization. In AAAI, 2021c.
- Multilevel language and vision integration for text-to-clip retrieval. In AAAI, 2019.
- Boundary-sensitive pre-training for temporal localization in videos. In ICCV, 2021.
- Local correspondence network for weakly supervised temporal sentence grounding. TIP, 2021.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
- Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In NeurIPS, 2019a.
- To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI, 2019b.
- A closer look at temporal sentence grounding in videos: Dataset and metric. In ACMMM Workshops, 2021.
- Dense regression network for video grounding. In CVPR, 2020.
- Vl-nms: Breaking proposal bottlenecks in two-stage visual-language matching. TOMM, 2023a.
- Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR, 2019.
- Span-based localizing network for natural language video localization. In ACL, 2020a.
- Natural language video localization: A revisit in span-based question answering framework. TPAMI, 2021a.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In ICLR, 2022.
- Video summarization with long short-term memory. In ECCV, 2016.
- Multi-stage aggregated transformer network for temporal language localization in videos. In CVPR, 2021b.
- Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, 2020b.
- Multi-scale 2d temporal adjacency networks for moment localization with natural language. TPAMI, 2021c.
- Text-visual prompting for efficient 2d temporal video grounding. In CVPR, 2023b.
- Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In ACMMM, 2020c.
- Counterfactual contrastive learning for weakly-supervised vision-language grounding. In NeurIPS, 2020d.
- Weakly supervised video moment localization with contrastive negative sample mining. In AAAI, 2022a.
- Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In CVPR, 2022b.
- Image bert pre-training with online tokenizer. In ICLR, 2021.
- Chaolei Tan (8 papers)
- Jianhuang Lai (43 papers)
- Wei-Shi Zheng (148 papers)
- Jian-Fang Hu (21 papers)