Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks (2401.03177v1)
Abstract: Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.
- Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 1708–1718. https://doi.org/10.1109/ICCV48922.2021.00175
- David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
- Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. CoRR abs/2109.04290 (2021). arXiv:2109.04290
- Youngok Choi and Edie M Rasmussen. 2002. Users’ relevance criteria in image retrieval in American history. Information processing & management 38, 5 (2002), 695–726.
- Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583–11593.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). 4171–4186.
- Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE transactions on circuits and systems for video technology 32, 8 (2022), 5680–5694.
- A feature-space multimodal data augmentation technique for text-video retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4385–4394.
- CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. CoRR abs/2106.11097 (2021). arXiv:2106.11097
- Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3558–3565.
- Multi-modal Transformer for Video Retrieval. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 12349). Springer, 214–229. https://doi.org/10.1007/978-3-030-58548-8_13
- A deep look into neural ranking models for information retrieval. Information Processing & Management 57, 6 (2020), 102067.
- Ssan: Separable self-attention network for video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12618–12627.
- Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609 (2021).
- Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment. CoRR abs/2305.12218 (2023). https://doi.org/10.48550/arXiv.2305.12218 arXiv:2305.12218
- Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331–7341.
- HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Association for Computational Linguistics, 2046–2065. https://doi.org/10.18653/v1/2020.emnlp-main.161
- SViTT: Temporal Learning of Sparse Video-Text Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18919–18929.
- Reasoning over different types of knowledge graphs: Static, temporal and multi-modal. arXiv preprint arXiv:2212.05767 (2022).
- FeatInter: exploring fine-grained object features for video-text retrieval. Neurocomputing 496 (2022), 178–191.
- HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 11895–11905. https://doi.org/10.1109/ICCV48922.2021.01170
- Use What You Have: Video retrieval using representations from collaborative experts. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019. BMVA Press, 279.
- Animating Images to Transfer CLIP for Video-Text Retrieval. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. ACM, 1906–1911. https://doi.org/10.1145/3477495.3531776
- CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304. https://doi.org/10.1016/j.neucom.2022.07.028
- X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 638–647. https://doi.org/10.1145/3503161.3547910
- Support-set bottlenecks for video-text representation learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18983–18992.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748–8763.
- Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 2895–2905. https://doi.org/10.18653/v1/p19-1279
- Hybrid contrastive quantization for efficient cross-view video retrieval. In Proceedings of the ACM Web Conference 2022. 3020–3030.
- T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 5079–5088. https://doi.org/10.1109/CVPR46437.2021.00504
- Hanet: Hierarchical alignment networks for video-text retrieval. In Proceedings of the 29th ACM international conference on Multimedia. 3518–3527.
- Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10704–10713.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.
- Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023. ACM, 2394–2398. https://doi.org/10.1145/3539618.3592064
- TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 11542–11552. https://doi.org/10.1109/ICCV48922.2021.01136
- Deep learning for video-text retrieval: a review. Int. J. Multim. Inf. Retr. 12, 1 (2023), 3. https://doi.org/10.1007/s13735-023-00267-8
- Dictionary learning based sparse coefficients for audio classification with max and average pooling. Digit. Signal Process. 23, 3 (2013), 960–970. https://doi.org/10.1016/j.dsp.2013.01.004
- Qian Li (236 papers)
- Lixin Su (15 papers)
- Jiashu Zhao (13 papers)
- Long Xia (25 papers)
- Hengyi Cai (20 papers)
- Suqi Cheng (17 papers)
- Hengzhu Tang (7 papers)
- Junfeng Wang (175 papers)
- Dawei Yin (165 papers)