FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens (2506.03096v1)

Published 3 Jun 2025 in cs.CV and cs.LG

Abstract: Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.

Summary

The paper introduces FuseLIP, a novel multimodal embedding method that fuses discrete tokens early for enhanced image-text integration.
It employs a single transformer encoder with both contrastive and masked modeling losses to improve efficiency and capture modality interactions.
Evaluation shows FuseLIP outperforms late fusion models in tasks like VQA and image classification, highlighting its robust architecture.

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

The paper presents FuseLIP, a novel method for multimodal embeddings that offers an alternative to existing contrastive language-image pre-training (CLIP) models, achieving significant improvements in various tasks requiring the fusion of visual and textual data. FuseLIP introduces an early fusion strategy that encodes image-text pairs using discrete tokens and a single transformer model, departing from the traditional CLIP-like approach that relies on separate encoders for each modality.

Architectural Innovations

FuseLIP utilizes discrete image tokenizers, mapping images and text into a shared vocabulary of tokens. This early fusion enables rich interactions between modalities throughout the encoding, which contrasts with the late fusion methods that merge deeply-processed unimodal features. The architecture is based on a transformer model that processes both text and image tokens within the same framework. The use of a single encoder facilitates seamless integration of contrastive and masked modeling loss objectives, improving computational efficiency and simplifying training.

Training Methodology

The training of FuseLIP entails the use of both contrastive loss and masked multimodal modeling (MMM) loss. The contrastive loss aligns image-text pairs in a shared latent space and is adapted from the SigLIP text and image encoder framework. The MMM loss leverages discrete tokens for masked predictions, benefiting from the increased interaction between modalities afforded by early fusion. This approach allows for effective training on both unimodal and multimodal datasets, enhancing the model's robustness and versatility.

Evaluation and Results

FuseLIP demonstrates superior performance on tasks like visual question answering (VQA), text-guided image transformation retrieval, and image classification, often outperforming models that rely on late fusion strategies. Notably, in tasks such as text-guided transformations, FuseLIP showed remarkable accuracy improvements, which are indicative of its ability to capture intricate relationships between image and text modalities. Tests conducted included a mix of synthetic and real-world datasets, further proving FuseLIP's efficacy in handling diverse data sources. The authors attribute this success to the interaction between modalities at every layer of encoding rather than at the final stages, as is common in alternative methods.

Contributions and Future Directions

The paper offers several key contributions:

It presents a new early fusion approach for multimodal embeddings, enhancing vision-language alignment and maintaining zero-shot capabilities.
It bridges the gap between contrastive and masked modeling through an efficient unified architecture.
It highlights the critical role of early fusion and hard negative examples in capturing the nuances of multimodal tasks.

The insights from FuseLIP prompt further exploration of early fusion techniques and discrete token modeling in AI, particularly in developing large-scale language and vision models that require high adaptability and efficiency. Future research could investigate scalability aspects of FuseLIP, exploring how increased model size or training data might impact performance and inference speed.

Overall, FuseLIP represents a step forward in multimodal embedding strategies, suggesting that early modal interaction provides substantial benefits in capturing the complexity of image-text relationships.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (4)

Tweets

https://twitter.com/chs20_/status/1930588775307972854