Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism (2311.13946v1)
Abstract: Video moment retrieval is to identify the target moment according to the given sentence in an untrimmed video. Due to temporal boundary annotations of the video are extremely time-consuming to acquire, modeling in the weakly-supervised setting is increasingly focused, where we only have access to the video-sentence pairs during training. Most existing weakly-supervised methods adopt a MIL-based framework to develop inter-sample confrontment, but neglect the intra-sample confrontment between moments with similar semantics. Therefore, these methods fail to distinguish the correct moment from plausible negative moments. Further, the previous attention models in cross-modal interaction tend to focus on a few dominant words exorbitantly, ignoring the comprehensive video-sentence correspondence. In this paper, we propose a novel Regularized Two-Branch Proposal Network with Erasing Mechanism to consider the inter-sample and intra-sample confrontments simultaneously. Concretely, we first devise a language-aware visual filter to generate both enhanced and suppressed video streams. Then, we design the sharable two-branch proposal module to generate positive and plausible negative proposals from the enhanced and suppressed branch respectively, contributing to sufficient confrontment. Besides, we introduce an attention-guided dynamic erasing mechanism in enhanced branch to discover the complementary video-sentence relation. Moreover, we apply two types of proposal regularization to stabilize the training process and improve model performance. The extensive experiments on ActivityCaption, Charades-STA and DiDeMo datasets show the effectiveness of our method.
- J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localization via language query,” ICCV, pp. 5277–5285, 2017.
- L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. C. Russell, “Localizing moments in video with natural language,” CVPR, pp. 5804–5813, 2017.
- J. Chen, X. Chen, L. Ma, Z. Jie, and T. Chua, “Temporally grounding natural sentence in video,” EMNLP, pp. 162–171, 2018.
- Z. Zhang, Z. Lin, Z. Zhao, and Z. Xiao, “Cross-modal interaction networks for query-based moment retrieval in videos,” SIGIR, pp. 655–664, 2019.
- W. Wang, Y. Huang, and L. Wang, “Language-driven temporal activity localization: A semantic matching reinforcement learning model,” CVPR, pp. 334–343, 2019.
- N. C. Mithun, S. Paul, and A. K. Roychowdhury, “Weakly supervised video moment retrieval from text queries,” CVPR, pp. 11 592–11 601, 2019.
- M. Gao, C. Xiong, R. Socher, and L. S. Davis, “Wslln: Weakly supervised natural language localization networks,” NIPS, pp. 1481–1487, 2019.
- Z. Chen, L. Ma, W. Luo, P. Tang, and K. K. Wong, “Look closer to ground better: Weakly-supervised temporal grounding of sentence in video.” arXiv: Computer Vision and Pattern Recognition, 2020.
- A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
- X. Wang, A. Shrivastava, and A. Gupta, “A-fast-rcnn: Hard positive generation via adversary for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3039–3048.
- K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3544–3553.
- Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan, “Object region mining with adversarial erasing: A simple classification to semantic segmentation approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6488–6496.
- J. Choe and H. Shim, “Attention-based dropout layer for weakly supervised object localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2219–2228.
- X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. S. Huang, “Adversarial complementary learning for weakly supervised object localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1325–1334.
- Z. Zhang, Z. Lin, Z. Zhao, J. Zhu, and X. He, “Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2020, pp. 4098–4106.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
- Z. Zhang, Z. Lin, Z. Zhao, and Z. Xiao, “Cross-modal interaction networks for query-based moment retrieval in videos,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 655–664.
- S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent networks for moment localization with natural language,” AAAI, 2020.
- D. Liu, T. Jiang, and Y. Wang, “Completeness modeling and context separation for weakly supervised temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1298–1307.
- Z. Chen, L. Ma, W. Luo, and K.-Y. K. Wong, “Weakly-supervised spatio-temporally grounding natural sentence in video,” in Proceedings of the Conference on the Association for Computational Linguistics, 2019.
- M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal, “Grounding action descriptions in videos,” Transactions of the Association for Computational Linguistics, vol. 1, pp. 25–36, 2013.
- Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in untrimmed videos via multi-stage cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1049–1058.
- Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin, “Temporal action detection with structured segment networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017.
- Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 1417–1426.
- Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster r-cnn architecture for temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1130–1139.
- R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan, “Graph convolutional networks for temporal action localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2019.
- M. Xu, C. Zhao, D. S. Rojas, A. K. Thabet, and B. Ghanem, “G-TAD: sub-graph localization for temporal action detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 10 153–10 162. [Online]. Available: https://doi.org/10.1109/CVPR42600.2020.01017
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
- L. Wang, Y. Xiong, D. Lin, and L. Van Gool, “Untrimmednets for weakly supervised action recognition and detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- P. Nguyen, T. Liu, G. Prasad, and B. Han, “Weakly supervised action localization by sparse temporal pooling network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6752–6761.
- Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang, “Autoloc: Weakly-supervised temporal action localization in untrimmed videos,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 154–171.
- T. Yu, Z. Ren, Y. Li, E. Yan, N. Xu, and J. Yuan, “Temporal structure mining for weakly supervised action detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5522–5531.
- L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, “Localizing moments in video with temporal language,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 2018, pp. 1380–1390.
- M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T. Chua, “Attentive moment retrieval in videos,” SIGIR, pp. 15–24, 2018.
- M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, “Cross-modal moment localization in videos,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2018, pp. 843–851.
- D. Zhang, X. Dai, X. Wang, Y. Wang, and L. S. Davis, “Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment,” CVPR, 2019.
- Z. Lin, Z. Zhao, Z. Zhang, Z. Zhang, and D. Cai, “Moment retrieval via cross-modal interaction networks with query reconstruction,” IEEE Transactions on Image Processing, vol. 29, pp. 3750–3762, 2020.
- H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, “Multilevel language and vision integration for text-to-clip retrieval,” AAAI, vol. 33, no. 01, pp. 9062–9069, 2019.
- Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, “Semantic conditioned dynamic modulation for temporal sentence grounding in videos,” NIPS, pp. 534–544, 2019.
- J. Chen, L. Ma, X. Chen, Z. Jie, and J. Luo, “Localizing natural language in videos,” AAAI, vol. 33, no. 01, pp. 8175–8182, 2019.
- L. Chen, C. Lu, S. Tang, J. Xiao, D. Zhang, C. Tan, and X. Li, “Rethinking the bottom-up framework for query-based video localization,” in Proceedings of the American Association for Artificial Intelligence, 2020.
- D. He, X. Zhao, J. Huang, F. Li, X. Liu, and S. Wen, “Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos,” AAAI, 2019.
- Z. Zhang, Z. Zhao, Y. Zhao, Q. Wang, H. Liu, and L. Gao, “Where does it exist: Spatio-temporal video grounding for multi-form sentences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 668–10 677.
- Z. Zhang, Z. Zhao, Z. Lin, B. Huai, and J. Yuan, “Object-aware multi-branch relation networks for spatio-temporal video grounding,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2020, pp. 1069–1075.
- Z. Zhang, Z. Zhao, Z. Lin, J. Song, and D. Cai, “Localizing unseen activities in video via image query,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2019.
- X. Duan, W. Huang, C. Gan, J. Wang, W. Zhu, and J. Huang, “Weakly supervised dense event captioning in videos,” in Advances in Neural Information Processing Systems, 2018, pp. 3059–3069.
- Z. Lin, Z. Zhao, Z. Zhang, Q. Wang, and H. Liu, “Weakly-supervised video moment retrieval via semantic completion network,” in Proceedings of the American Association for Artificial Intelligence, 2020.
- Q. En, L. Duan, Z. Zhang, X. Bai, and Y. Zhang, “Human-like delicate region erasing strategy for weakly supervised detection,” in Proceedings of the American Association for Artificial Intelligence, 2019, pp. 3502–3509.
- X. Liu, Z. Wang, J. Shao, X. Wang, and H. Li, “Improving referring expression grounding with cross-modal attention-guided erasing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1950–1959.
- J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” EMNLP, pp. 1532–1543, 2014.
- J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in Advances in Neural Information Processing Systems, 2014.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” ICCV, pp. 4489–4497, 2015.
- G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Proceedings of the European Conference on Computer Vision, 2016.
- J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
- Haoyuan Li (62 papers)
- Zhou Zhao (219 papers)
- Zhu Zhang (39 papers)
- Zhijie Lin (30 papers)