Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval (2308.13820v1)

Published 26 Aug 2023 in cs.IR

Abstract: Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it difficult to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval which consists of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks.In specific, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder-decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase.We feed masked fused representations into the encoder and reconstruct them with the decoder, ultimately separating the original data of two modalities. In downstream tasks, we use the pre-trained encoder to build the cross-modal retrieval method. Extensive experiments on 2 real-world datasets show that our approach outperforms previous state-of-the-art methods in video-audio matching tasks, improving retrieval accuracy by up to 2 times. Furthermore, we prove our model performance by transferring it to other downstream tasks as a universal model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zichen Yuan (1 paper)
  2. Qi Shen (41 papers)
  3. Bingyi Zheng (1 paper)
  4. Yuting Liu (62 papers)
  5. Linying Jiang (7 papers)
  6. Guibing Guo (35 papers)