Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval (2103.15049v2)

Published 28 Mar 2021 in cs.CV and cs.AI

Abstract: Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on the three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our method.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Song Liu (159 papers)
  2. Haoqi Fan (33 papers)
  3. Shengsheng Qian (13 papers)
  4. Yiru Chen (10 papers)
  5. Wenkui Ding (13 papers)
  6. Zhongyuan Wang (105 papers)
Citations (136)

Summary

We haven't generated a summary for this paper yet.