FuseLIP: Early Fusion for Vision-Language Tasks
- FuseLIP is a multimodal embedding architecture that fuses image and text tokens early in a unified transformer to obtain joint representations.
- It employs a frozen discrete image tokenizer and bidirectional transformer layers for effective cross-modal self-attention at every encoding step.
- Empirical results show FuseLIP outperforms traditional dual-encoder and late-fusion models in fine-grained vision-language tasks.
FuseLIP is a multimodal embedding architecture that implements early fusion of discrete text and image tokens within a unified transformer encoder. FuseLIP departs from conventional dual-encoder (two-tower) contrastive language-image pre-training (CLIP-style) frameworks, enabling direct processing of concatenated image and text token sequences to obtain a joint representation. By leveraging discrete image tokenization and a single bidirectional transformer, FuseLIP facilitates cross-modal attention at every encoding layer, yielding improved performance on fine-grained vision-language tasks compared to late-fusion strategies (Schlarmann et al., 3 Jun 2025).
1. Model Architecture
FuseLIP employs a frozen discrete image tokenizer (TiTok), mapping each input image to a sequence of 128 integer tokens selected from an image-specific sub-vocabulary . Parallelly, text inputs are tokenized via byte-pair encoding (BPE), producing tokens from a disjoint text sub-vocabulary . The resulting token sequence consists of a special <bot> token, image tokens, <eot> tokens, text tokens, and a final <eot>:
This sequence is passed through learned token and positional embeddings, summed to produce the layer 0 input for a shared, bidirectional transformer encoder (no causal masking). At each layer, cross-modal self-attention operates over the joint token stream. The final multimodal embedding is extracted by taking the appropriately normalized output vector at the position of the last <eot> token after the final transformer block. This early-fusion topology contrasts with score-fusion and MagicLens late-fusion baselines, which limit cross-modal integration to post-encoding modules (Schlarmann et al., 3 Jun 2025).
2. Input Representation and Embedding
Formally, given token indices with , the input representation at position is computed as:
where is the shared token embedding matrix and is the learned position embedding. At each transformer layer , the representation is updated as:
The final output used as the multimodal embedding is the normalized vector at the last <eot> position:
where is the position of the last <eot> token.
3. Training Objectives and Loss Functions
FuseLIP is trained end-to-end (excluding the frozen tokenizer) with a composite objective comprising a SigLIP-style contrastive loss and a masked multimodal modeling (MMM) loss:
- SigLIP Contrastive Loss: For a batch of pairs and labels if and correspond, otherwise, the loss is:
with , learnable parameters and producing -normalized vectors.
- Masked Multimodal Modeling (MMM) Loss: With token masking probability , let denote masked token positions and the corresponding true labels, using a shared classifier :
where is the cross-entropy loss.
The final optimization objective is:
4. Datasets and Pre-training
FuseLIP is trained using both unimodal (imageātext pairs) and generated multimodal datasets. The training sources and their respective data modalities are summarized as follows:
| Dataset | IāT (imgātxt) | ITāI (prompted img) | ITāT (prompted txt) |
|---|---|---|---|
| CC3M | 2.6M | 0 | 0 |
| CC12M | 10.6M | 0 | 0 |
| CC3M-TGIT | 0.3M | 0.3M | 0 |
| CC3M-VQA | 0 | 0 | 2.4M |
| VG-VQA | 0 | 0 | 0.7M |
| VG-Crop | 0 | 5.4M | 0 |
| HQ-Edit | 0 | 0.3M | 0.3M |
- CC3M/CC12M: Standard captioned image-text datasets.
- CC3M/CC12M-TGIT: Text-guided image transformation data (e.g., cropping, rotation).
- CC3M-VQA, VG-VQA: Synthetic VQA using Llama-3 Instruct and Visual Genome.
- VG-Crop: Grounding tasks from Visual Genome region captions.
- HQ-Edit: Scripted image edits with known inverses.
Hard negatives (e.g., multiple transformed versions, alternative region captions, inverse edits) are included in each batch to robustify contrastive training and task alignment (Schlarmann et al., 3 Jun 2025).
5. Evaluation and Empirical Results
FuseLIP is evaluated on a suite of downstream tasks, including zero-shot image classification (e.g., ImageNet-1k, CIFAR, Places), visual question answering (OK-VQA, GQA, Visual7W, TextVQA), retrieval (MSCOCO-t2i/i2t, CIRR, FashionIQ), grounding (RefCOCO, Visual7W pointing), and specifically constructed tasks for multimodal evaluation such as OI-Crop, OI-Pos, VG-Crop, and TGIT.
| Model | Classification | VQA | Retrieval | Grounding | VG-Crop | OI-Crop | OI-Pos | TGIT |
|---|---|---|---|---|---|---|---|---|
| FuseLIP-B+MMM (CC3M+MM) | 23.3 | 17.5 | 15.0 | 82.4 | 55.8 | 68.1 | 70.8 | 94.3 |
| FuseLIP-B+MMM (CC12M+MM) | 31.2 | 19.8 | 26.2 | 82.3 | 32.7 | 61.5 | 71.3 | 94.2 |
Key findings include:
- FuseLIP-B (+MMM) achieves the top accuracy in the majority of multimodal tasks, surpassing score-fusion SigLIP-SF and transformer-fusion MLF baselines, even with fewer trainable parameters.
- The most significant improvements are observed on tasks (such as TGIT, crop/rotate/flip subtasks) requiring explicit, integrated vision-language reasoning, where early fusion is essential.
- Ablation studies indicate a critical dependence on hard negatives (removal degrades certain tasks by up to 80%) and the Masked Multimodal Modeling loss (removal results in 2ā4% accuracy drops, especially for multimodal tasks) (Schlarmann et al., 3 Jun 2025).
6. Comparison with Prior Approaches
Traditional CLIP-derived models utilize either score fusion (embedding summation) or a shallow fusion network (MLF) for late integration of separately encoded modalities, limiting cross-modal exchange to topmost layers. FuseLIP, by contrast, achieves early fusionācross-modal self-attention is enabled from the very first transformer layer. Empirical results demonstrate that this architecture produces richer, more structurally aligned embeddings for multimodal tasks requiring fine-grained feature fusion. The early-fusion approach also mitigates the information bottleneck present in late-fusion settings and allows gradient-based learning to jointly inform both image and text modalities (Schlarmann et al., 3 Jun 2025).
7. Implementation Details, Limitations, and Prospects
Model Sizes and Configurations:
- FuseLIP-S: TiTok-S (128 tokens), 12-layer ViT-S (d = 384, 6 heads), 42M trainable + 25M frozen parameters.
- FuseLIP-B: TiTok-BL (128), 12-layer ViT-B (d = 512, 8 heads), 65M trainable + 86M frozen parameters.
- Baselines employ OpenCLIP ViT-S/B, with additional 4-layer fusion modules where relevant.
Optimization and Infrastructure:
- Optimizer: AdamW (, )
- Learning rate: , cosine decay, 12k warmup steps
- Batch size: 2048
- Training spans 8 epochs (CC3M+MM) or 16 epochs (CC12M+MM), image resolution 256, context length 180
- Memory consumption is ~11GB (FuseLIP-S) vs. ~19GB (baselines). Training is ~20% faster due to the frozen tokenizer.
- Implemented in PyTorch (using OpenCLIP), trained on NVIDIA A100 GPUs.
Limitations and Future Directions:
- Current results are confined to academic-scale corpora (CC3M/CC12M); scalability to 100M+ captions and the impact on performance remain unexplored.
- The two-stage process (frozen image tokenizer ā transformer) introduces higher inference latency compared to direct vision encoders, though this may diminish with larger model widths or future optimized tokenizers.
- Potential extensions include multimodal chains (interleaved images and text), video modeling, group-wise retrieval, and integration with multimodal LLMs (Schlarmann et al., 3 Jun 2025).
In summary, FuseLIP evidences that replacing late-fusion architectures with a frozen discrete image tokenizer coupled to a unified transformer encoder substantially enhances performance on multimodal reasoning tasks, providing a robust foundation for future vision-language representation learning.