Papers
Topics
Authors
Recent
Search
2000 character limit reached

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Published 26 Mar 2024 in cs.CV | (2403.17998v1)

Abstract: The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In NeurIPS, 2021.
  2. Localizing moments in video with natural language. In ICCV, 2017.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  4. A clip-hitchhiker’s guide to long video retrieval. arXiv, 2022.
  5. Cross modal retrieval with querybank normalisation. In CVPR, 2022.
  6. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR, 2020.
  7. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv, 2021.
  8. Teachtext: Crossmodal generalized distillation for text-video retrieval. In CVPR, 2021.
  9. Prompt switch: Efficient clip adaptation for text-video retrieval. In ICCV, 2023.
  10. Partially relevant video retrieval. In ACM MM, 2022.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv, 2020.
  12. Mdmmt: Multidomain multimodal transformer for video retrieval. In CVPR, 2021.
  13. Uatvr: Uncertainty-adaptive text-video retrieval. In ICCV, 2023.
  14. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
  15. Multi-modal transformer for video retrieval. In ECCV, 2020.
  16. CLIP2TV: an empirical study on transformer-based methods for video-text retrieval. CoRR, 2021.
  17. X-pool: Cross-modal language-video attention for text-video retrieval. In CVPR, 2022.
  18. Pidro: Parallel isomeric attention with dynamic routing for text-video retrieval. In ICCV, 2023.
  19. Adversarial multi-grained embedding network for cross-modal text-video retrieval. ACM MM, 2022.
  20. Clover: Towards a unified video-language alignment and fusion model. In CVPR, 2023.
  21. Audio-enhanced text-to-video retrieval using text-conditioned feature alignment. In ICCV, 2023.
  22. Seeing what you miss: Vision-language pre-training with semantic completion learning. In CVPR, 2023.
  23. Expectation-maximization contrastive learning for compact video-and-language representations. In NeurIPS, 2022.
  24. Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In CVPR, 2023a.
  25. Text-video retrieval with disentangled conceptualization and set-to-set alignment. In IJCAI, 2023b.
  26. Diffusionret: Generative text-video retrieval with diffusion model. In ICCV, 2023c.
  27. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  28. Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, 2021.
  29. Lavender: Unifying video-language understanding as masked language modeling. In CVPR, 2023a.
  30. Progressive spatio-temporal prototype matching for text-video retrieval. In ICCV, 2023b.
  31. Svitt: Temporal learning of sparse video-text transformers. In CVPR, 2023c.
  32. Text-adaptive multiple visual prototype matching for video-text retrieval. NeurIPS, 2022a.
  33. Eclipse: Efficient long-range video retrieval using sight and sound. In ECCV, 2022b.
  34. Use what you have: Video retrieval using representations from collaborative experts. In BMVC, 2019.
  35. Animating images to transfer clip for video-text retrieval. In ACM SIGIR, 2022a.
  36. Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV, 2022b.
  37. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. In Neurocomputing, 2022.
  40. Learning a text-video embedding from incomplete and heterogeneous data. arXiv, 2018.
  41. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  42. Clipping: Distilling clip-based models with a student base for video-language retrieval. In CVPR, 2023.
  43. A straightforward framework for video retrieval using clip. In MCPR, 2021.
  44. Learning transferable visual models from natural language supervision. In ICML, 2021.
  45. A dataset for movie description. In CVPR, 2015.
  46. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
  47. Attention is all you need. NeurIPS, 2017.
  48. Disentangled representation learning for text-video retrieval. arXiv, 2022a.
  49. Dig into multi-modal cues for video retrieval with hierarchical alignment. In IJCAI, 2021a.
  50. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
  51. T2vlad: global-local sequence alignment for text-video retrieval. In CVPR, 2021b.
  52. Learn to understand negation in video retrieval. In ACM MM, 2022b.
  53. Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, 2023.
  54. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
  55. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  56. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022.
  57. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. In ICLR, 2023.
  58. A joint sequence fusion model for video question answering and retrieval. In ECCV, 2018.
  59. Cali-nce: Boosting cross-modal video representation learning with calibrated alignment. In CVPRW, 2023.
  60. Centerclip: Token clustering for efficient text-video retrieval. In ACM SIGIR, 2022.
  61. Actbert: Learning global-local video-text representations. In CVPR, 2020.
Citations (8)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.