HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval (2103.15049v2)

Published 28 Mar 2021 in cs.CV and cs.AI

Abstract: Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on the three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our method.

Authors (6)

Song Liu (159 papers)
Haoqi Fan (33 papers)
Shengsheng Qian (13 papers)
Yiru Chen (10 papers)
Wenkui Ding (13 papers)
Zhongyuan Wang (105 papers)

Citations (136)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval (2103.15049v2)

Summary

Related Papers