Cross-Modal Discrete Representation Learning (2106.05438v1)

Published 10 Jun 2021 in cs.CV

Abstract: Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments we show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.

Authors (6)

Alexander H. Liu (32 papers)
SouYoung Jin (11 papers)
Cheng-I Jeff Lai (9 papers)
Andrew Rouditchenko (21 papers)
Aude Oliva (42 papers)
James Glass (173 papers)

Citations (38)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Cross-Modal Discrete Representation Learning (2106.05438v1)

Summary

Related Papers