Lightweight Cross-Modal Representation Learning (2403.04650v3)

Published 7 Mar 2024 in cs.LG and cs.AI

Abstract: Low-cost cross-modal representation learning is crucial for deriving semantic representations across diverse modalities such as text, audio, images, and video. Traditional approaches typically depend on large specialized models trained from scratch, requiring extensive datasets and resulting in high resource and time costs. To overcome these challenges, we introduce a novel approach named Lightweight Cross-Modal Representation Learning (LightCRL). This method uses a single neural network titled Deep Fusion Encoder (DFE), which projects data from multiple modalities into a shared latent representation space. This reduces the overall parameter count while still delivering robust performance comparable to more complex systems.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (51)

Authors (4)

Bilal Faye (10 papers)
Hanane Azzag (18 papers)
Mustapha Lebbah (30 papers)
Djamel Bouchaffra (6 papers)

Lightweight Cross-Modal Representation Learning (2403.04650v3)

Related Papers

Tweets