OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities (2409.11059v2)

Published 17 Sep 2024 in cs.CV and cs.LG

Abstract: Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection module (UP) to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.

Summary

The paper presents a novel, lightweight framework that incrementally aligns modalities using fixed, pretrained encoders and efficient alignment layers.
It introduces a Universal Projection module and Alignment Layer to project diverse modality features into a unified embedding space with minimal training.
The framework achieves strong performance in zero-shot classification, cross-modal retrieval, and VQA while significantly reducing computational costs.

"OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities"

Introduction

"OneEncoder" proposes an efficient and scalable framework for cross-modal alignment, addressing critical limitations in current multimodal models that require extensive computational resources and large datasets for training. The paper introduces a lightweight alignment framework that incrementally aligns modalities using pretrained modality-specific encoders, a Universal Projection (UP) module, and an Alignment Layer (AL).

Figure 1: Comparison of three modality alignment methods: Standard cross-modal vs. OneEncoder. Standard aligns via simultaneous training of modality-specific encoders. OneEncoder uses frozen, pretrained encoders with a lightweight Universal Projection (UP) module trained on two modalities. For new modalities, UP stays frozen, training only the Alignment Layer. Modality tokens enable efficient switching between modalities. Using this method, video can be aligned with other modalities (image, text, audio) in the same way.

Model Architecture

The OneEncoder architecture consists of several key components: modality-specific encoders, a Universal Projection (UP) module, and an Alignment Layer (AL), each playing a vital role in the progressive alignment framework.

Modality-Specific Encoders

OneEncoder employs frozen pretrained models for each modality, such as ViT for images, BERT for text, Wav2Vec2 for audio, and VideoMAE for video. These models extract modality-specific features, maintaining computational efficiency by eliminating the need to train these large encoders from scratch.

Universal Projection Module

The UP module is designed to project features from multiple modalities into a unified embedding space. It utilizes transformer layers to capture inter-modal relationships and aligns image-text modalities initially. The UP incorporates modality tokens to facilitate seamless transitions and alignments across modalities, providing a coherent latent space for multimodal tasks.

Figure 2: Step 1: Training the Lightweight UP and Aligning Image-Text Modalities

Alignment Layer

Upon aligning initial modalities, subsequent modalities are integrated using the AL. This lightweight MLP reduces the alignment complexity, facilitating efficient modality addition without retraining the entire framework. The AL projects new modality data into the shared alignment space established by the pretrained UP.

Training Procedure

The training process in OneEncoder follows a two-step approach: initial alignment and progressive alignment.

Initial Alignment

During the first step, the UP is trained using a contrastive learning framework on abundant, aligned datasets (e.g., image-text pairs). The UP defines a common semantic space for the initial modalities, guided by a learned contrastive objective to preserve mutual information.

Progressive Alignment

In subsequent steps, the pretrained UP remains fixed, and only the AL is updated. This setup allows OneEncoder to incorporate new modalities with minimal training cost by aligning them through already-aligned modalities, such as using text to align audio and video via text-audio or text-video datasets.

Applications and Evaluation

Zero-Shot Classification

OneEncoder demonstrates exceptional performance in zero-shot image classification tasks across various datasets, outperforming traditional models that require retraining all modality-specific encoders.

The effectiveness of OneEncoder extends to cross-modal retrieval tasks between text, audio, and video, demonstrating superior alignment performance validated through metrics like Top-1 precision and recall.

Figure 3: Image processing

Visual Question Answering

OneEncoder accommodates tasks beyond retrieval, such as Visual Question Answering (VQA), by using lightweight alignment strategies tailored to capture the intricacies of cross-modal reasoning required for accurate VQA.

Conclusion

OneEncoder introduces an innovative approach to multimodal integration, characterized by its lightweight design and progressive alignment strategy. By leveraging pretrained modality-specific encoders and focusing training on the UP and AL, OneEncoder achieves strong performance across diverse multimodal tasks with significantly reduced computational costs. This framework is poised to expand into more complex domains and can be adapted for high-demand tasks like open-vocabulary object detection, highlighting its potential for efficient and scalable multimodal learning.