Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video Grounding (2312.13633v1)
Abstract: Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query. Since datasets in this domain are often gathered from limited video scenes, models tend to overfit to scene-specific factors, which leads to suboptimal performance when encountering new scenes in real-world applications. In a new scene, the fine-grained annotations are often insufficient due to the expensive labor cost, while the coarse-grained video-query pairs are easier to obtain. Thus, to address this issue and enhance model performance on new scenes, we explore the TVG task in an unsupervised domain adaptation (UDA) setting across scenes for the first time, where the video-query pairs in the source scene (domain) are labeled with temporal boundaries, while those in the target scene are not. Under the UDA setting, we introduce a novel Adversarial Multi-modal Domain Adaptation (AMDA) method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data. Specifically, we tackle the domain gap by utilizing domain discriminators, which help identify valuable scene-related features effective across both domains. Concurrently, we mitigate the semantic gap between different modalities by aligning video-query pairs with related semantics. Furthermore, we employ a mask-reconstruction approach to enhance the understanding of temporal semantics within a scene. Extensive experiments on Charades-STA, ActivityNet Captions, and YouCook2 demonstrate the effectiveness of our proposed method.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, 961–970.
- Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 10551–10558.
- Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6321–6330.
- Multi-level attentive adversarial learning with temporal dilation for unsupervised video domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1259–1268.
- Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1072–1080.
- Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. arXiv preprint arXiv:2001.09308.
- Multi-modal cross-domain alignment network for video moment retrieval. IEEE Transactions on Multimedia.
- Domain-adversarial training of neural networks. The journal of machine learning research, 17(1): 2096–2030.
- Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, 5267–5275.
- Mac: Mining activity concepts for language-based temporal localization. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 245–253. IEEE.
- Deep Domain Adaptation in Action Space. In BMVC, volume 2, 5.
- Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706–715.
- Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 11539–11546.
- Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11235–11244.
- Learning transferable features with deep adaptation networks. In International conference on machine learning, 97–105. PMLR.
- Conditional adversarial domain adaptation. Advances in neural information processing systems, 31.
- Deep transfer learning with joint adaptation networks. In International conference on machine learning, 2208–2217. PMLR.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Unsupervised domain adaptation via discriminative manifold propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11592–11601.
- Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10810–10819.
- Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 122–132.
- Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval. arXiv preprint arXiv:2110.12812.
- Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2765–2775.
- Domain adaptation via transfer component analysis. IEEE transactions on neural networks, 22(2): 199–210.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
- Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1: 25–36.
- Generalized conditional domain adaptation: A causal perspective with low-rank translators. IEEE transactions on cybernetics, 50(2): 821–834.
- Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2464–2473.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, 510–526. Springer.
- Correlation alignment for unsupervised domain adaptation. In Domain Adaptation in Computer Vision Applications, 153–171. Springer.
- Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, 443–450. Springer.
- Regularized two granularity loss function for weakly supervised video moment retrieval. IEEE Transactions on Multimedia.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497.
- Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7167–7176.
- Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474.
- Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12168–12175.
- Scene-robust natural language video localization via learning domain-invariant representations. In Findings of the Association for Computational Linguistics: ACL 2023, 144–160.
- Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 9062–9069.
- Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1426–1435.
- Aligning correlation information for domain adaptation in action recognition. IEEE Transactions on Neural Networks and Learning Systems.
- HiSA: Hierarchically semantic associating for video temporal grounding. IEEE Transactions on Image Processing, 31: 5178–5188.
- Heterogeneous graph attention network for unsupervised multiple-target domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.
- Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10287–10296.
- Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931.
- Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12870–12877.
- Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 655–664.
- Cascaded prediction network via segment tree for temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4197–4206.
- Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15555–15564.
- Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Deep subdomain adaptation network for image classification. IEEE transactions on neural networks and learning systems, 32(4): 1713–1722.
- Haifeng Huang (20 papers)
- Yang Zhao (382 papers)
- Zehan Wang (37 papers)
- Yan Xia (169 papers)
- Zhou Zhao (218 papers)