FuseLIP: Early Fusion for Vision-Language Tasks

Updated 10 January 2026

FuseLIP is a multimodal embedding architecture that fuses image and text tokens early in a unified transformer to obtain joint representations.
It employs a frozen discrete image tokenizer and bidirectional transformer layers for effective cross-modal self-attention at every encoding step.
Empirical results show FuseLIP outperforms traditional dual-encoder and late-fusion models in fine-grained vision-language tasks.

FuseLIP is a multimodal embedding architecture that implements early fusion of discrete text and image tokens within a unified transformer encoder. FuseLIP departs from conventional dual-encoder (two-tower) contrastive language-image pre-training (CLIP-style) frameworks, enabling direct processing of concatenated image and text token sequences to obtain a joint representation. By leveraging discrete image tokenization and a single bidirectional transformer, FuseLIP facilitates cross-modal attention at every encoding layer, yielding improved performance on fine-grained vision-language tasks compared to late-fusion strategies (Schlarmann et al., 3 Jun 2025).

1. Model Architecture

FuseLIP employs a frozen discrete image tokenizer (TiTok), mapping each input image to a sequence of 128 integer tokens selected from an image-specific sub-vocabulary $V_i$ . Parallelly, text inputs are tokenized via byte-pair encoding (BPE), producing tokens from a disjoint text sub-vocabulary $V_t$ . The resulting token sequence consists of a special <bot> token, image tokens, <eot> tokens, text tokens, and a final <eot>:

$\text{tokens} = [\langle \mathrm{bot} \rangle, v_1, \ldots, v_n, \langle \mathrm{eot} \rangle, t_1, \ldots, t_m, \langle \mathrm{eot} \rangle]$

This sequence is passed through learned token and positional embeddings, summed to produce the layer 0 input for a shared, bidirectional transformer encoder (no causal masking). At each layer, cross-modal self-attention operates over the joint token stream. The final multimodal embedding $f(I, T)$ is extracted by taking the appropriately normalized output vector at the position of the last <eot> token after the final transformer block. This early-fusion topology contrasts with score-fusion and MagicLens late-fusion baselines, which limit cross-modal integration to post-encoding modules (Schlarmann et al., 3 Jun 2025).

2. Input Representation and Embedding

Formally, given token indices $S = (s_1, ..., s_K)$ with $K = n + m + 3$ , the input representation at position $i$ is computed as:

$\mathbf{x}_i = E_{\text{tok}}[s_i] + E_{\text{pos}}[i], \quad i = 1 \ldots K,$

where $E_{\text{tok}} \in \mathbb{R}^{|V| \times d}$ is the shared token embedding matrix and $E_{\text{pos}} \in \mathbb{R}^{K_{\max} \times d}$ is the learned position embedding. At each transformer layer $\ell$ , the representation is updated as:

$\mathbf{H}^{\ell} = \text{LayerNorm}( \mathbf{H}^{\ell-1} + \text{MultiHeadAttn}( \mathbf{H}^{\ell-1}) ) + \text{FeedForward}(\cdot) + \ldots$

The final output used as the multimodal embedding is the normalized vector at the last <eot> position:

$f(I, T) = \text{LayerNorm}( \mathbf{H}^L[j^*] ),$

where $j^*$ is the position of the last <eot> token.

3. Training Objectives and Loss Functions

FuseLIP is trained end-to-end (excluding the frozen tokenizer) with a composite objective comprising a SigLIP-style contrastive loss and a masked multimodal modeling (MMM) loss:

SigLIP Contrastive Loss: For a batch of $N$ pairs $\{ (x_r, y_r) \}$ and labels $z_{rs} = +1$ if $x_r$ and $y_s$ correspond, $-1$ otherwise, the loss is:

$L_\mathrm{Sig} = \frac{1}{N^2} \sum_{r, s = 1}^N \log\left(1 + \exp\left[ z_{rs} ( - t \cdot f(x_r) \cdot f(y_s) + b ) \right]\right)$

with $t$ , $b$ learnable parameters and $f(\cdot)$ producing $\ell_2$ -normalized vectors.

Masked Multimodal Modeling (MMM) Loss: With token masking probability $p = 0.1$ , let $J(x)$ denote masked token positions and $y_j$ the corresponding true labels, using a shared classifier $h: \mathbb{R}^d \to \mathbb{R}^{|V|}$ :

$L_\mathrm{MMM} = \frac{1}{N} \sum_{r=1}^N \sum_{j \in J(x_r)} \text{CE}( h( \mathbf{H}^L_j(x_r) ), y_j )$

where $\text{CE}$ is the cross-entropy loss.

The final optimization objective is:

$\min_{\theta_\mathrm{enc},\theta_\mathrm{head}} L_\mathrm{Sig} + \alpha L_\mathrm{MMM}, \quad \alpha = 0.25.$

4. Datasets and Pre-training

FuseLIP is trained using both unimodal (image–text pairs) and generated multimodal datasets. The training sources and their respective data modalities are summarized as follows:

Dataset	I–T (img↔txt)	IT–I (prompted img)	IT–T (prompted txt)
CC3M	2.6M	0	0
CC12M	10.6M	0	0
CC3M-TGIT	0.3M	0.3M	0
CC3M-VQA	0	0	2.4M
VG-VQA	0	0	0.7M
VG-Crop	0	5.4M	0
HQ-Edit	0	0.3M	0.3M

CC3M/CC12M: Standard captioned image-text datasets.
CC3M/CC12M-TGIT: Text-guided image transformation data (e.g., cropping, rotation).
CC3M-VQA, VG-VQA: Synthetic VQA using Llama-3 Instruct and Visual Genome.
VG-Crop: Grounding tasks from Visual Genome region captions.
HQ-Edit: Scripted image edits with known inverses.

Hard negatives (e.g., multiple transformed versions, alternative region captions, inverse edits) are included in each batch to robustify contrastive training and task alignment (Schlarmann et al., 3 Jun 2025).

5. Evaluation and Empirical Results

FuseLIP is evaluated on a suite of downstream tasks, including zero-shot image classification (e.g., ImageNet-1k, CIFAR, Places), visual question answering (OK-VQA, GQA, Visual7W, TextVQA), retrieval (MSCOCO-t2i/i2t, CIRR, FashionIQ), grounding (RefCOCO, Visual7W pointing), and specifically constructed tasks for multimodal evaluation such as OI-Crop, OI-Pos, VG-Crop, and TGIT.

Model	Classification	VQA	Retrieval	Grounding	VG-Crop	OI-Crop	OI-Pos	TGIT
FuseLIP-B+MMM (CC3M+MM)	23.3	17.5	15.0	82.4	55.8	68.1	70.8	94.3
FuseLIP-B+MMM (CC12M+MM)	31.2	19.8	26.2	82.3	32.7	61.5	71.3	94.2

Key findings include:

FuseLIP-B (+MMM) achieves the top accuracy in the majority of multimodal tasks, surpassing score-fusion SigLIP-SF and transformer-fusion MLF baselines, even with fewer trainable parameters.
The most significant improvements are observed on tasks (such as TGIT, crop/rotate/flip subtasks) requiring explicit, integrated vision-language reasoning, where early fusion is essential.
Ablation studies indicate a critical dependence on hard negatives (removal degrades certain tasks by up to 80%) and the Masked Multimodal Modeling loss (removal results in 2–4% accuracy drops, especially for multimodal tasks) (Schlarmann et al., 3 Jun 2025).

6. Comparison with Prior Approaches

Traditional CLIP-derived models utilize either score fusion (embedding summation) or a shallow fusion network (MLF) for late integration of separately encoded modalities, limiting cross-modal exchange to topmost layers. FuseLIP, by contrast, achieves early fusion—cross-modal self-attention is enabled from the very first transformer layer. Empirical results demonstrate that this architecture produces richer, more structurally aligned embeddings for multimodal tasks requiring fine-grained feature fusion. The early-fusion approach also mitigates the information bottleneck present in late-fusion settings and allows gradient-based learning to jointly inform both image and text modalities (Schlarmann et al., 3 Jun 2025).

7. Implementation Details, Limitations, and Prospects

Model Sizes and Configurations:

FuseLIP-S: TiTok-S (128 tokens), 12-layer ViT-S (d = 384, 6 heads), 42M trainable + 25M frozen parameters.
FuseLIP-B: TiTok-BL (128), 12-layer ViT-B (d = 512, 8 heads), 65M trainable + 86M frozen parameters.
Baselines employ OpenCLIP ViT-S/B, with additional 4-layer fusion modules where relevant.

Optimization and Infrastructure:

Optimizer: AdamW ( $\beta_1 = 0.9, \beta_2 = 0.98$ , $\epsilon = 10^{-8}$ )
Learning rate: $10^{-3}$ , cosine decay, 12k warmup steps
Batch size: 2048
Training spans 8 epochs (CC3M+MM) or 16 epochs (CC12M+MM), image resolution 256, context length 180
Memory consumption is ~11GB (FuseLIP-S) vs. ~19GB (baselines). Training is ~20% faster due to the frozen tokenizer.
Implemented in PyTorch (using OpenCLIP), trained on NVIDIA A100 GPUs.

Limitations and Future Directions:

Current results are confined to academic-scale corpora (CC3M/CC12M); scalability to 100M+ captions and the impact on performance remain unexplored.
The two-stage process (frozen image tokenizer → transformer) introduces higher inference latency compared to direct vision encoders, though this may diminish with larger model widths or future optimized tokenizers.
Potential extensions include multimodal chains (interleaved images and text), video modeling, group-wise retrieval, and integration with multimodal LLMs (Schlarmann et al., 3 Jun 2025).

In summary, FuseLIP evidences that replacing late-fusion architectures with a frozen discrete image tokenizer coupled to a unified transformer encoder substantially enhances performance on multimodal reasoning tasks, providing a robust foundation for future vision-language representation learning.

Markdown Upgrade to Chat

References (1)

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FuseLIP.