Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism (2311.13946v1)

Published 23 Nov 2023 in cs.MM

Abstract: Video moment retrieval is to identify the target moment according to the given sentence in an untrimmed video. Due to temporal boundary annotations of the video are extremely time-consuming to acquire, modeling in the weakly-supervised setting is increasingly focused, where we only have access to the video-sentence pairs during training. Most existing weakly-supervised methods adopt a MIL-based framework to develop inter-sample confrontment, but neglect the intra-sample confrontment between moments with similar semantics. Therefore, these methods fail to distinguish the correct moment from plausible negative moments. Further, the previous attention models in cross-modal interaction tend to focus on a few dominant words exorbitantly, ignoring the comprehensive video-sentence correspondence. In this paper, we propose a novel Regularized Two-Branch Proposal Network with Erasing Mechanism to consider the inter-sample and intra-sample confrontments simultaneously. Concretely, we first devise a language-aware visual filter to generate both enhanced and suppressed video streams. Then, we design the sharable two-branch proposal module to generate positive and plausible negative proposals from the enhanced and suppressed branch respectively, contributing to sufficient confrontment. Besides, we introduce an attention-guided dynamic erasing mechanism in enhanced branch to discover the complementary video-sentence relation. Moreover, we apply two types of proposal regularization to stabilize the training process and improve model performance. The extensive experiments on ActivityCaption, Charades-STA and DiDeMo datasets show the effectiveness of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localization via language query,” ICCV, pp. 5277–5285, 2017.
  2. L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. C. Russell, “Localizing moments in video with natural language,” CVPR, pp. 5804–5813, 2017.
  3. J. Chen, X. Chen, L. Ma, Z. Jie, and T. Chua, “Temporally grounding natural sentence in video,” EMNLP, pp. 162–171, 2018.
  4. Z. Zhang, Z. Lin, Z. Zhao, and Z. Xiao, “Cross-modal interaction networks for query-based moment retrieval in videos,” SIGIR, pp. 655–664, 2019.
  5. W. Wang, Y. Huang, and L. Wang, “Language-driven temporal activity localization: A semantic matching reinforcement learning model,” CVPR, pp. 334–343, 2019.
  6. N. C. Mithun, S. Paul, and A. K. Roychowdhury, “Weakly supervised video moment retrieval from text queries,” CVPR, pp. 11 592–11 601, 2019.
  7. M. Gao, C. Xiong, R. Socher, and L. S. Davis, “Wslln: Weakly supervised natural language localization networks,” NIPS, pp. 1481–1487, 2019.
  8. Z. Chen, L. Ma, W. Luo, P. Tang, and K. K. Wong, “Look closer to ground better: Weakly-supervised temporal grounding of sentence in video.” arXiv: Computer Vision and Pattern Recognition, 2020.
  9. A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
  10. X. Wang, A. Shrivastava, and A. Gupta, “A-fast-rcnn: Hard positive generation via adversary for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3039–3048.
  11. K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3544–3553.
  12. Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan, “Object region mining with adversarial erasing: A simple classification to semantic segmentation approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6488–6496.
  13. J. Choe and H. Shim, “Attention-based dropout layer for weakly supervised object localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2219–2228.
  14. X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. S. Huang, “Adversarial complementary learning for weakly supervised object localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1325–1334.
  15. Z. Zhang, Z. Lin, Z. Zhao, J. Zhu, and X. He, “Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos,” in Proceedings of the ACM International Conference on Multimedia.   ACM, 2020, pp. 4098–4106.
  16. R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
  17. Z. Zhang, Z. Lin, Z. Zhao, and Z. Xiao, “Cross-modal interaction networks for query-based moment retrieval in videos,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 655–664.
  18. S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent networks for moment localization with natural language,” AAAI, 2020.
  19. D. Liu, T. Jiang, and Y. Wang, “Completeness modeling and context separation for weakly supervised temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1298–1307.
  20. Z. Chen, L. Ma, W. Luo, and K.-Y. K. Wong, “Weakly-supervised spatio-temporally grounding natural sentence in video,” in Proceedings of the Conference on the Association for Computational Linguistics, 2019.
  21. M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal, “Grounding action descriptions in videos,” Transactions of the Association for Computational Linguistics, vol. 1, pp. 25–36, 2013.
  22. Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in untrimmed videos via multi-stage cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1049–1058.
  23. Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin, “Temporal action detection with structured segment networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017.
  24. Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2017, pp. 1417–1426.
  25. Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster r-cnn architecture for temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1130–1139.
  26. R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan, “Graph convolutional networks for temporal action localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2019.
  27. M. Xu, C. Zhao, D. S. Rojas, A. K. Thabet, and B. Ghanem, “G-TAD: sub-graph localization for temporal action detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020.   IEEE, 2020, pp. 10 153–10 162. [Online]. Available: https://doi.org/10.1109/CVPR42600.2020.01017
  28. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
  29. L. Wang, Y. Xiong, D. Lin, and L. Van Gool, “Untrimmednets for weakly supervised action recognition and detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  30. P. Nguyen, T. Liu, G. Prasad, and B. Han, “Weakly supervised action localization by sparse temporal pooling network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6752–6761.
  31. Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang, “Autoloc: Weakly-supervised temporal action localization in untrimmed videos,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 154–171.
  32. T. Yu, Z. Ren, Y. Li, E. Yan, N. Xu, and J. Yuan, “Temporal structure mining for weakly supervised action detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5522–5531.
  33. L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, “Localizing moments in video with temporal language,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing.   ACL, 2018, pp. 1380–1390.
  34. M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T. Chua, “Attentive moment retrieval in videos,” SIGIR, pp. 15–24, 2018.
  35. M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, “Cross-modal moment localization in videos,” in Proceedings of the ACM International Conference on Multimedia.   ACM, 2018, pp. 843–851.
  36. D. Zhang, X. Dai, X. Wang, Y. Wang, and L. S. Davis, “Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment,” CVPR, 2019.
  37. Z. Lin, Z. Zhao, Z. Zhang, Z. Zhang, and D. Cai, “Moment retrieval via cross-modal interaction networks with query reconstruction,” IEEE Transactions on Image Processing, vol. 29, pp. 3750–3762, 2020.
  38. H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, “Multilevel language and vision integration for text-to-clip retrieval,” AAAI, vol. 33, no. 01, pp. 9062–9069, 2019.
  39. Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, “Semantic conditioned dynamic modulation for temporal sentence grounding in videos,” NIPS, pp. 534–544, 2019.
  40. J. Chen, L. Ma, X. Chen, Z. Jie, and J. Luo, “Localizing natural language in videos,” AAAI, vol. 33, no. 01, pp. 8175–8182, 2019.
  41. L. Chen, C. Lu, S. Tang, J. Xiao, D. Zhang, C. Tan, and X. Li, “Rethinking the bottom-up framework for query-based video localization,” in Proceedings of the American Association for Artificial Intelligence, 2020.
  42. D. He, X. Zhao, J. Huang, F. Li, X. Liu, and S. Wen, “Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos,” AAAI, 2019.
  43. Z. Zhang, Z. Zhao, Y. Zhao, Q. Wang, H. Liu, and L. Gao, “Where does it exist: Spatio-temporal video grounding for multi-form sentences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 668–10 677.
  44. Z. Zhang, Z. Zhao, Z. Lin, B. Huai, and J. Yuan, “Object-aware multi-branch relation networks for spatio-temporal video grounding,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2020, pp. 1069–1075.
  45. Z. Zhang, Z. Zhao, Z. Lin, J. Song, and D. Cai, “Localizing unseen activities in video via image query,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2019.
  46. X. Duan, W. Huang, C. Gan, J. Wang, W. Zhu, and J. Huang, “Weakly supervised dense event captioning in videos,” in Advances in Neural Information Processing Systems, 2018, pp. 3059–3069.
  47. Z. Lin, Z. Zhao, Z. Zhang, Q. Wang, and H. Liu, “Weakly-supervised video moment retrieval via semantic completion network,” in Proceedings of the American Association for Artificial Intelligence, 2020.
  48. Q. En, L. Duan, Z. Zhang, X. Bai, and Y. Zhang, “Human-like delicate region erasing strategy for weakly supervised detection,” in Proceedings of the American Association for Artificial Intelligence, 2019, pp. 3502–3509.
  49. X. Liu, Z. Wang, J. Shao, X. Wang, and H. Li, “Improving referring expression grounding with cross-modal attention-guided erasing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1950–1959.
  50. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” EMNLP, pp. 1532–1543, 2014.
  51. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in Advances in Neural Information Processing Systems, 2014.
  52. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” ICCV, pp. 4489–4497, 2015.
  53. G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Proceedings of the European Conference on Computer Vision, 2016.
  54. J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haoyuan Li (62 papers)
  2. Zhou Zhao (219 papers)
  3. Zhu Zhang (39 papers)
  4. Zhijie Lin (30 papers)