Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks (2401.03177v1)

Published 6 Jan 2024 in cs.CV and cs.CL

Abstract: Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 1708–1718. https://doi.org/10.1109/ICCV48922.2021.00175
  2. David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
  3. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. CoRR abs/2109.04290 (2021). arXiv:2109.04290
  4. Youngok Choi and Edie M Rasmussen. 2002. Users’ relevance criteria in image retrieval in American history. Information processing & management 38, 5 (2002), 695–726.
  5. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583–11593.
  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). 4171–4186.
  7. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE transactions on circuits and systems for video technology 32, 8 (2022), 5680–5694.
  8. A feature-space multimodal data augmentation technique for text-video retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4385–4394.
  9. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. CoRR abs/2106.11097 (2021). arXiv:2106.11097
  10. Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3558–3565.
  11. Multi-modal Transformer for Video Retrieval. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 12349). Springer, 214–229. https://doi.org/10.1007/978-3-030-58548-8_13
  12. A deep look into neural ranking models for information retrieval. Information Processing & Management 57, 6 (2020), 102067.
  13. Ssan: Separable self-attention network for video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12618–12627.
  14. Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609 (2021).
  15. Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment. CoRR abs/2305.12218 (2023). https://doi.org/10.48550/arXiv.2305.12218 arXiv:2305.12218
  16. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331–7341.
  17. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Association for Computational Linguistics, 2046–2065. https://doi.org/10.18653/v1/2020.emnlp-main.161
  18. SViTT: Temporal Learning of Sparse Video-Text Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18919–18929.
  19. Reasoning over different types of knowledge graphs: Static, temporal and multi-modal. arXiv preprint arXiv:2212.05767 (2022).
  20. FeatInter: exploring fine-grained object features for video-text retrieval. Neurocomputing 496 (2022), 178–191.
  21. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 11895–11905. https://doi.org/10.1109/ICCV48922.2021.01170
  22. Use What You Have: Video retrieval using representations from collaborative experts. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019. BMVA Press, 279.
  23. Animating Images to Transfer CLIP for Video-Text Retrieval. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. ACM, 1906–1911. https://doi.org/10.1145/3477495.3531776
  24. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304. https://doi.org/10.1016/j.neucom.2022.07.028
  25. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 638–647. https://doi.org/10.1145/3503161.3547910
  26. Support-set bottlenecks for video-text representation learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  27. CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18983–18992.
  28. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748–8763.
  29. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 2895–2905. https://doi.org/10.18653/v1/p19-1279
  30. Hybrid contrastive quantization for efficient cross-view video retrieval. In Proceedings of the ACM Web Conference 2022. 3020–3030.
  31. T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 5079–5088. https://doi.org/10.1109/CVPR46437.2021.00504
  32. Hanet: Hierarchical alignment networks for video-text retrieval. In Proceedings of the 29th ACM international conference on Multimedia. 3518–3527.
  33. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10704–10713.
  34. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.
  35. Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023. ACM, 2394–2398. https://doi.org/10.1145/3539618.3592064
  36. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 11542–11552. https://doi.org/10.1109/ICCV48922.2021.01136
  37. Deep learning for video-text retrieval: a review. Int. J. Multim. Inf. Retr. 12, 1 (2023), 3. https://doi.org/10.1007/s13735-023-00267-8
  38. Dictionary learning based sparse coefficients for audio classification with max and average pooling. Digit. Signal Process. 23, 3 (2013), 960–970. https://doi.org/10.1016/j.dsp.2013.01.004
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qian Li (236 papers)
  2. Lixin Su (15 papers)
  3. Jiashu Zhao (13 papers)
  4. Long Xia (25 papers)
  5. Hengyi Cai (20 papers)
  6. Suqi Cheng (17 papers)
  7. Hengzhu Tang (7 papers)
  8. Junfeng Wang (175 papers)
  9. Dawei Yin (165 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets