Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video (2401.07567v2)
Abstract: Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort to utilizing prior knowledge about bias to artificially break this uneven distribution, which only removes a limited amount of significant language biases. In this work, we propose the bias-conflict sample synthesis and adversarial removal debias strategy (BSSARD), which dynamically generates bias-conflict samples by explicitly leveraging potentially spurious correlations between single-modality features and the temporal position of the target moments. Through adversarial training, its bias generators continuously introduce biases and generate bias-conflict samples to deceive its grounding model. Meanwhile, the grounding model continuously eliminates the introduced biases, which requires it to model multi-modality alignment information. BSSARD will cover most kinds of coupling relationships and disrupt language and visual biases simultaneously. Extensive experiments on Charades-CD and ActivityNet-CD demonstrate the promising debiasing capability of BSSARD. Source codes are available at https://github.com/qzhb/BSSARD.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
- A too-good-to-be-true prior to reduce shortcut reliance. arXiv preprint arXiv:2102.06406.
- Latent adversarial debiasing: Mitigating collider bias in deep neural networks. arXiv preprint arXiv:2011.11486.
- Relation-aware video reading comprehension for temporal language grounding. arXiv preprint arXiv:2110.05717.
- Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1523–1532.
- Excl: Extractive clip localization using natural language descriptions. arXiv preprint arXiv:1904.02755.
- Greedy gradient ensemble for robust visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1584–1593.
- Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding. In Proceedings of the European Conference on Computer Vision, 130–147.
- Query-aware video encoder for video moment retrieval. Neurocomputing, 483: 72–86.
- Coarse-to-fine semantic alignment for cross-modal moment localization. IEEE Transactions on Image Processing, 30: 5933–5943.
- Curriculum multi-negative augmentation for debiased video grounding. In Proceedings of the AAAI Conference on Artificial Intelligence.
- A closer look at debiased temporal sentence grounding in videos: Dataset, metric, and approach. ACM Transactions on Multimedia Computing, Communications, and Applications.
- Reducing the vision and language bias for temporal sentence grounding. In Proceedings of the 30th ACM International Conference on Multimedia, 4092–4101.
- Fixing weight decay regularization in adam.
- Debug: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 5144–5153.
- Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10810–10819.
- Uncovering hidden challenges in query-based video moment retrieval. arXiv preprint arXiv:2009.00325.
- Glove: Global vectors for word representation. In Proceedings of the conference on empirical methods in natural language processing, 1532–1543.
- Self-Regulated Learning for Egocentric Video Activity Anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6): 6715–6730.
- Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia, 4280–4288.
- Frame-wise cross-modal matching for video moment retrieval. IEEE Transactions on Multimedia, 24: 1338–1349.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497.
- Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7026–7035.
- Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12168–12175. ISBN 2374-3468.
- G3AN: Disentangling appearance and motion for video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5264–5273.
- Scene-robust Natural Language Video Localization via Learning Domain-invariant Representations. In Findings of the Association for Computational Linguistics: ACL 2023, 144–160.
- Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1–10.
- Counterfactual Two-Stage Debiasing For Video Corpus Moment Retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing, 1–5. IEEE.
- Cross interaction network for natural language guided video moment retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1860–1864.
- Semantic conditioned dynamic modulation for temporal sentence grounding in videos. Advances in Neural Information Processing Systems, 32.
- To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 9159–9166. ISBN 2374-3468.
- Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10287–10296.
- Natural language video localization: A revisit in span-based question answering framework. IEEE transactions on pattern analysis and machine intelligence.
- Parallel attention network with sequence matching for video grounding. arXiv preprint arXiv:2105.08481.
- Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931.
- Towards debiasing temporal sentence grounding in video. arXiv preprint arXiv:2111.04321.
- Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12669–12678.
- Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12870–12877. ISBN 2374-3468.
- Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 655–664.
- Cascaded prediction network via segment tree for temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4197–4206.
- Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 14197–14209.
- Phrase-level Temporal Relationship Mining for Temporal Sentence Localization. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Zhaobo Qi (4 papers)
- Yibo Yuan (4 papers)
- Xiaowen Ruan (1 paper)
- Shuhui Wang (54 papers)
- Weigang Zhang (9 papers)
- Qingming Huang (168 papers)