Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval (2404.12216v2)

Published 18 Apr 2024 in cs.CV

Abstract: Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. “Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss,” arXiv preprint arXiv:2109.04290, 2021.
  2. “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022.
  3. “Multi-modal transformer for video retrieval,” in European Conference on Computer Vision (ECCV). Springer, 2020, vol. 5.
  4. “Use what you have: Video retrieval using representations from collaborative experts,” arXiv preprint arXiv:1907.13487, 2019.
  5. “Transferring image-clip to video-text retrieval via temporal relations,” IEEE Transactions on Multimedia, 2022.
  6. “Clip2tv: An empirical study on transformer-based methods for video-text retrieval,” arXiv preprint arXiv:2111.05610, 2021.
  7. “T2vlad: global-local sequence alignment for text-video retrieval,” in CVPR, 2021.
  8. “Probabilistic embeddings for cross-modal retrieval,” in CVPR, 2021.
  9. “Msr-vtt: A large video description dataset for bridging video and language,” in CVPR, 2016.
  10. “MDMMT: multidomain multimodal transformer for video retrieval,” in CVPR Workshops, 2021.
  11. “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  12. “Representation learning with contrastive predictive coding,” 2018.
  13. “Taco: Token-aware cascade contrastive learning for video-text alignment,” in CVPR, 2021.
  14. “Modeling uncertainty with hedged instance embedding,” arXiv preprint arXiv:1810.00319, 2018.
  15. “Probabilistic representations for video contrastive learning,” in CVPR, 2022.
  16. “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  17. “Movie description,” International Journal of Computer Vision, vol. 123, no. 1, pp. 94–120, 2017.
  18. “Localizing moments in video with natural language,” in ICCV, 2017.
  19. “Less is more: Clipbert for video-and-language learning via sparse sampling,” in CVPR, 2021.
  20. “Hit: Hierarchical transformer with momentum contrast for video-text retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11915–11925.
  21. “Learning a text-video embedding from incomplete and heterogeneous data,” arXiv preprint arXiv:1804.02516, 2018.
  22. “Disentangled representation learning for text-video retrieval,” arXiv preprint arXiv:2203.07111, 2022.
  23. “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in ICCV, 2021.
  24. “X-pool: Cross-modal language-video attention for text-video retrieval,” in CVPR, 2022.
  25. “Ts2-net: Token shift and selection transformer for text-video retrieval,” in ECCV. Springer, 2022.
  26. “Centerclip: Token clustering for efficient text-video retrieval,” arXiv preprint arXiv:2205.00823, 2022.
  27. “Teachtext: Crossmodal generalized distillation for text-video retrieval,” in CVPR, 2021, pp. 11583–11593.
  28. “Cap4video: What can auxiliary captions do for text-video retrieval?,” arXiv preprint arXiv:2301.00184, 2022.
  29. “Decaf: A deep convolutional activation feature for generic visual recognition,” in ICML, 2014.
  30. “Support-set bottlenecks for video-text representation learning,” in International Conference on Learning Representations (ICLR), 2021.
  31. “A straightforward framework for video retrieval using CLIP,” in MCPR, Mexico City, Mexico, Edgar Roman-Rangel, Ángel Fernando Kuri Morales, José Francisco Martínez Trinidad, Jesús Ariel Carrasco-Ochoa, and José Arturo Olvera-López, Eds. 2021, vol. 12725 of Lecture Notes in Computer Science, pp. 3–12, Springer.
  32. “Univilm: A unified video and language pre-training model for multimodal understanding and generation,” arXiv preprint arXiv:2002.06353, 2020.
  33. “Cross-modal and hierarchical modeling of video and text,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 374–390.
  34. “A joint sequence fusion model for video question answering and retrieval,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 471–487.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Han Fang (61 papers)
  2. Xianghao Zang (9 papers)
  3. Chao Ban (7 papers)
  4. Zerun Feng (7 papers)
  5. Lanxiang Zhou (2 papers)
  6. Zhongjiang He (11 papers)
  7. Yongxiang Li (22 papers)
  8. Hao Sun (383 papers)