SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval (2404.14066v2)
Abstract: The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
- C. Zhu, Q. Jia, W. Chen, Y. Guo, and Y. Liu, “Deep learning for video-text retrieval: a review,” Int. J. Multim. Inf. Retr., vol. 12, no. 1, p. 3, 2023.
- Y. Liu, P. Xiong, L. Xu, S. Cao, and Q. Jin, “Ts2-net: Token shift and selection transformer for text-video retrieval,” in Computer Vision ECCV European Conference, Proceedings, Part XIV, ser. Lecture Notes in Computer Science, vol. 13674. Springer, 2022, pp. 319–335.
- H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li, “Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022.
- Z. Gao, J. Liu, S. Chen, D. Chang, H. Zhang, and J. Yuan, “CLIP2TV: an empirical study on transformer-based methods for video-text retrieval,” CoRR, vol. abs/2111.05610, 2021.
- Q. Wang, Y. Zhang, Y. Zheng, P. Pan, and X. Hua, “Disentangled representation learning for text-video retrieval,” CoRR, vol. abs/2203.07111, 2022.
- Y. Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji, “X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval,” in MM: The ACM International Conference on Multimedia. ACM, 2022, pp. 638–647.
- R. Liu, J. Huang, G. Li, J. Feng, X. Wu, and T. H. Li, “Revisiting temporal modeling for clip-based image-to-video knowledge transferring,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. IEEE, 2023, pp. 6555–6564.
- P. Wu, X. He, M. Tang, Y. Lv, and J. Liu, “Hanet: Hierarchical alignment networks for video-text retrieval,” in MM ’21: ACM Multimedia Conference. ACM, 2021, pp. 3518–3527.
- M. Wray, G. Csurka, D. Larlus, and D. Damen, “Fine-grained action retrieval through multiple parts-of-speech embeddings,” in IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 2019, pp. 450–459.
- S. Chen, Y. Zhao, Q. Jin, and Q. Wu, “Fine-grained video-text retrieval with hierarchical graph reasoning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, 2020, pp. 10 635–10 644.
- H. Fang, P. Xiong, L. Xu, and Y. Chen, “Clip2video: Mastering video-text retrieval via image CLIP,” CoRR, vol. abs/2106.11097, 2021.
- M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 2021, pp. 1708–1718.
- M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spaCy: Industrial-strength Natural Language Processing in Python,” 2020.
- Y. Huo, M. Zhang, G. Liu, H. Lu, Y. Gao, G. Yang, J. Wen, H. Zhang, B. Xu, W. Zheng, Z. Xi, Y. Yang, A. Hu, J. Zhao, R. Li, Y. Zhao, L. Zhang, Y. Song, X. Hong, W. Cui, D. Y. Hou, Y. Li, J. Li, P. Liu, Z. Gong, C. Jin, Y. Sun, S. Chen, Z. Lu, Z. Dou, Q. Jin, Y. Lan, W. X. Zhao, R. Song, and J. Wen, “Wenlan: Bridging vision and language by large-scale multi-modal pre-training,” CoRR, vol. abs/2103.06561, 2021.
- J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems, NeurIPS, 2021, pp. 9694–9705.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the International Conference on Machine Learning, ICML, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 2021, pp. 8748–8763.
- L. Yuan, D. Chen, Y. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, C. Liu, M. Liu, Z. Liu, Y. Lu, Y. Shi, L. Wang, J. Wang, B. Xiao, Z. Xiao, J. Yang, M. Zeng, L. Zhou, and P. Zhang, “Florence: A new foundation model for computer vision,” CoRR, vol. abs/2111.11432, 2021.
- C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” in IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 2019, pp. 7463–7472.
- L. Li, Y. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, “HERO: hierarchical encoder for video+language omni-representation pre-training,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP. Association for Computational Linguistics, 2020, pp. 2046–2065.
- A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” in IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 2019, pp. 2630–2640.
- X. Dong, Q. Guo, T. Gan, Q. Wang, J. Wu, X. Ren, Y. Cheng, and W. Chu, “Snp-s33{}^{\mbox{3}}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Shared network pre-training and significant semantic strengthening for various video-text tasks,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 4, pp. 2525–2535, 2024.
- C. Ma, H. Sun, Y. Rao, J. Zhou, and J. Lu, “Video saliency forecasting transformer,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 6850–6862, 2022.
- X. Yang, F. Lv, F. Liu, and G. Lin, “Self-training vision language berts with a unified conditional model,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 3560–3569, 2023.
- W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: pre-training of generic visual-linguistic representations,” in International Conference on Learning Representations, ICLR. OpenReview.net, 2020.
- G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in The AAAI Conference on Artificial Intelligence, AAAI, The Innovative Applications of Artificial Intelligence Conference, IAAI, The AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI. AAAI Press, 2020, pp. 11 336–11 344.
- H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, “Univilm: A unified video and language pre-training model for multimodal understanding and generation,” CoRR, vol. abs/2002.06353, 2020.
- V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in Computer Vision ECCV European Conference, Proceedings, Part IV, ser. Lecture Notes in Computer Science, vol. 12349. Springer, 2020, pp. 214–229.
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022.
- Y. Yang, L. Jiao, X. Liu, F. Liu, S. Yang, L. Li, P. Chen, X. Li, and Z. Huang, “Dual wavelet attention networks for image classification,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 4, pp. 1899–1910, 2023.
- M. Wang, J. Xing, and Y. Liu, “Actionclip: A new paradigm for video action recognition,” CoRR, vol. abs/2109.08472, 2021.
- Y. Chen, H. Ge, Y. Liu, X. Cai, and L. Sun, “AGPN: action granularity pyramid network for video action recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 3912–3923, 2023.
- T. Yu, J. Yu, Z. Yu, Q. Huang, and Q. Tian, “Long-term video question answering via multimodal hierarchical memory attentive networks,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 3, pp. 931–944, 2021.
- B. Li, W. Zhang, M. Tian, G. Zhai, and X. Wang, “Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 5944–5958, 2022.
- J. Zhang, J. Shao, R. Cao, L. Gao, X. Xu, and H. T. Shen, “Action-centric relation transformer network for video question answering,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 63–74, 2022.
- W. Zhao, H. Wu, W. He, H. Bi, H. Wang, C. Zhu, T. Xu, and E. Chen, “Hierarchical multi-modal attention network for time-sync comment video recommendation,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 4, pp. 2694–2705, 2024.
- Q. Cao, H. Huang, M. Ren, and C. Yuan, “Concept-enhanced relation network for video visual relation inference,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 5, pp. 2233–2244, 2023.
- F. Zhang, R. Wang, F. Zhou, and Y. Luo, “ERM: energy-based refined-attention mechanism for video question answering,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1454–1467, 2023.
- Y. Ou, Z. Chen, and F. Wu, “Multimodal local-global attention network for affective video content analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 5, pp. 1901–1914, 2021.
- J. Xu, B. Liu, Y. Chen, M. Cheng, and X. Shi, “Multi: Efficient video-and-language understanding with text-guided multiway-sampler and multiple choice modeling,” in AAAI Conference on Artificial Intelligence, AAAI Conference on Innovative Applications of Artificial Intelligence, IAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 2024, pp. 6297–6305.
- J. Dong, Y. Wang, X. Chen, X. Qu, X. Li, Y. He, and X. Wang, “Reading-strategy inspired visual representation learning for text-to-video retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5680–5694, 2022.
- Z. Feng, Z. Zeng, C. Guo, and Z. Li, “Temporal multimodal graph transformer with global-local alignment for video-text retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1438–1453, 2023.
- H. Xue, Y. Sun, B. Liu, J. Fu, R. Song, H. Li, and J. Luo, “Clip-vip: Adapting pre-trained image-text model to video-language representation alignment,” CoRR, vol. abs/2209.06430, 2022.
- S. Zhao, L. Zhu, X. Wang, and Y. Yang, “Centerclip: Token clustering for efficient text-video retrieval,” in SIGIR: The International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2022, pp. 970–981.
- S. K. Gorti, N. Vouitsis, J. Ma, K. Golestan, M. Volkovs, A. Garg, and G. Yu, “X-pool: Cross-modal language-video attention for text-video retrieval,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. IEEE, 2022, pp. 4996–5005.
- X. Cheng, H. Lin, X. Wu, F. Yang, and D. Shen, “Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss,” CoRR, vol. abs/2109.04290, 2021.
- J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, 2016, pp. 5288–5296.
- L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. C. Russell, “Localizing moments in video with natural language,” in IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society, 2017, pp. 5804–5813.
- F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, 2015, pp. 961–970.
- D. L. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in The Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. The Association for Computer Linguistics, 2011, pp. 190–200.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, ICLR, Conference Track Proceedings, 2015.