ProTA: Probabilistic Token Aggregation for Text-Video Retrieval (2404.12216v2)
Abstract: Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).
- “Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss,” arXiv preprint arXiv:2109.04290, 2021.
- “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022.
- “Multi-modal transformer for video retrieval,” in European Conference on Computer Vision (ECCV). Springer, 2020, vol. 5.
- “Use what you have: Video retrieval using representations from collaborative experts,” arXiv preprint arXiv:1907.13487, 2019.
- “Transferring image-clip to video-text retrieval via temporal relations,” IEEE Transactions on Multimedia, 2022.
- “Clip2tv: An empirical study on transformer-based methods for video-text retrieval,” arXiv preprint arXiv:2111.05610, 2021.
- “T2vlad: global-local sequence alignment for text-video retrieval,” in CVPR, 2021.
- “Probabilistic embeddings for cross-modal retrieval,” in CVPR, 2021.
- “Msr-vtt: A large video description dataset for bridging video and language,” in CVPR, 2016.
- “MDMMT: multidomain multimodal transformer for video retrieval,” in CVPR Workshops, 2021.
- “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- “Representation learning with contrastive predictive coding,” 2018.
- “Taco: Token-aware cascade contrastive learning for video-text alignment,” in CVPR, 2021.
- “Modeling uncertainty with hedged instance embedding,” arXiv preprint arXiv:1810.00319, 2018.
- “Probabilistic representations for video contrastive learning,” in CVPR, 2022.
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “Movie description,” International Journal of Computer Vision, vol. 123, no. 1, pp. 94–120, 2017.
- “Localizing moments in video with natural language,” in ICCV, 2017.
- “Less is more: Clipbert for video-and-language learning via sparse sampling,” in CVPR, 2021.
- “Hit: Hierarchical transformer with momentum contrast for video-text retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11915–11925.
- “Learning a text-video embedding from incomplete and heterogeneous data,” arXiv preprint arXiv:1804.02516, 2018.
- “Disentangled representation learning for text-video retrieval,” arXiv preprint arXiv:2203.07111, 2022.
- “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in ICCV, 2021.
- “X-pool: Cross-modal language-video attention for text-video retrieval,” in CVPR, 2022.
- “Ts2-net: Token shift and selection transformer for text-video retrieval,” in ECCV. Springer, 2022.
- “Centerclip: Token clustering for efficient text-video retrieval,” arXiv preprint arXiv:2205.00823, 2022.
- “Teachtext: Crossmodal generalized distillation for text-video retrieval,” in CVPR, 2021, pp. 11583–11593.
- “Cap4video: What can auxiliary captions do for text-video retrieval?,” arXiv preprint arXiv:2301.00184, 2022.
- “Decaf: A deep convolutional activation feature for generic visual recognition,” in ICML, 2014.
- “Support-set bottlenecks for video-text representation learning,” in International Conference on Learning Representations (ICLR), 2021.
- “A straightforward framework for video retrieval using CLIP,” in MCPR, Mexico City, Mexico, Edgar Roman-Rangel, Ángel Fernando Kuri Morales, José Francisco Martínez Trinidad, Jesús Ariel Carrasco-Ochoa, and José Arturo Olvera-López, Eds. 2021, vol. 12725 of Lecture Notes in Computer Science, pp. 3–12, Springer.
- “Univilm: A unified video and language pre-training model for multimodal understanding and generation,” arXiv preprint arXiv:2002.06353, 2020.
- “Cross-modal and hierarchical modeling of video and text,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 374–390.
- “A joint sequence fusion model for video question answering and retrieval,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 471–487.
- Han Fang (61 papers)
- Xianghao Zang (9 papers)
- Chao Ban (7 papers)
- Zerun Feng (7 papers)
- Lanxiang Zhou (2 papers)
- Zhongjiang He (11 papers)
- Yongxiang Li (22 papers)
- Hao Sun (383 papers)