Early Fusion: One-Tower Encoders

Updated 16 March 2026

Early fusion is a data-level approach that merges raw or minimally processed signals from multiple modalities into a shared one-tower encoder, enabling early cross-modal interactions.
It employs operations such as concatenation, learned projections, element-wise multiplication, and gating to create joint representations before deep feature extraction.
This method enhances fine-grained alignment and reduces training complexity, though it remains sensitive to noise and misalignment in the input data.

Early fusion, also known as data-level fusion or the one-tower encoder paradigm, is a structural strategy in multimodal machine learning in which the raw or minimally processed signals from multiple modalities are combined at the input stage and fed into a single, shared encoder. This approach enables cross-modal interaction from the earliest representational layers, allowing complex dependencies and joint features to emerge throughout the model. It contrasts with feature-level (intermediate) and output-level (late) fusion, which defer cross-modal merging to deeper layers or after unimodal processing and are commonly realized as multi-tower or dual-branch architectures (Li et al., 2024).

1. Conceptual Foundations and Typology

Early fusion is situated at the "data-level" in the commonly adopted structural taxonomy of multimodal alignment and fusion. The defining characteristic is that all available modalities—such as images, audio, text, depth maps—are merged into a unified representation prior to any deep feature extraction, processed together through a single stack of shared parameters (e.g., CNN, transformer, MLP). Classic fusion operations include channel- or sequence-wise concatenation, learned linear projections, element-wise (Hadamard) product, and gating mechanisms:

Operation	Mathematical Formulation	Characteristics
Concatenation	$z = [x_1; x_2; …; x_M]$	Preserves all input dimensions
Learned linear projection	$z = W[x_1; x_2] + b$	Projects to arbitrary fused space
Element-wise product	$z = x_1 \circ x_2$	Multiplicative cross-modal interactions
Gated fusion	$z = \sigma(W_g[x_1;x_2]+b_g) \odot f(W_f[x_1;x_2]+b_f)$	Dynamically weighs modal contributions

This contrasts with intermediate fusion (separate encoders, fusion at feature level) and late fusion (independent branches merged at the output or decision stage). (Li et al., 2024)

2. Architectural Implementations

Transformer-Based Early Fusion

In mixed-modal transformer settings, early fusion forms token sequences interleaving or concatenating modalities, processed through standard multi-head self-attention layers. The Chameleon model forms a single stream of interleaved text and (VQ-VAE quantized) image tokens, embedded via a shared lookup and positional encoding, directly input to decoder-only transformer blocks with mixed-modal attention (Team, 2024). In retrieval, the Joint Fusion Encoder concatenates visual and language tokens, with a special [Emb] token pooled as joint embedding, and a single multimodal transformer encoder realizes cross-modal attention at every layer (Huang et al., 27 Feb 2025). For audio-visual tasks, densely factorized attention enables fine-grained local fusion while controlling quadratic complexity (Mo et al., 2023).

CNN-Based and Hybrid Encoders

In medical imaging, fusion is performed by concatenating registered MRI channel volumes at the input to a 3D U-Net or nnU-Net, with the earliest convolutional layer operating on the stacked signal tensor (Remedios et al., 2024). For multimodal sequential data, a ConvLSTM may merge vectorized audio spectrograms and visual patches in its first recurrent layer, combining at the lowest possible representational level (Barnum et al., 2020).

Discrete Token Unification

Tokenization-based early fusion is typified by FuseLIP and Ichigo, where inputs from multiple modalities (e.g., image patches via VQ-VAE/qVAE, raw speech via vector quantization) are both mapped to the same discrete token space and modeled with a single embedding table. A unified vocabulary enables interleaved processing and a shared transformer stack for all modalities (Schlarmann et al., 3 Jun 2025, Dao et al., 2024).

In strictly unimodal transformer encoders, early fusion concerns content/position fusion—e.g., summing or gating token embeddings with positional encodings. In long-sequence language modeling, the choice of fusion operator (addition, learned projection, or gating) is shown to become a significant modeling decision (Hallam et al., 9 Jan 2026).

3. Quantitative Performance and Empirical Findings

Early fusion consistently enables modeling of complex cross-modal dependencies, often leading to substantial improvements over late fusion in settings that demand fine-grained cross-modal alignment or reasoning. In the Joint Fusion Encoder for retrieval (Huang et al., 27 Feb 2025), Recall@5 is increased over two-tower baselines by 8 points for true fusion-dependent tasks. In audio-visual transformers, early fusion achieves state-of-the-art in audio-event classification, sound segmentation, and visually-guided sound separation, outperforming late fusion and unimodal approaches (Mo et al., 2023).

In segmentation tasks with imperfectly registered multimodal medical images, early fusion (image concatenation before the encoder) in nnU-Net yields a statistically significant mean Dice gain of +0.0021, but the magnitude is modest and sensitive to registration quality (Remedios et al., 2024). In RGB-D 3D object categorization, late fusion outperforms early fusion by up to 5.3% top-1 accuracy, attributed to better utilization of pretrained image representations and reduced overfitting on low-data regimes (Tziafas et al., 2022).

Mixed-modal models such as Chameleon (Team, 2024) and FuseLIP (Schlarmann et al., 3 Jun 2025) demonstrate that a one-tower encoder can achieve state-of-the-art or competitive multimodal and unimodal task performance at reduced training cost and model complexity relative to multi-tower or late-fusion systems. Empirical evaluation in FuseLIP indicates early fusion improves grounding and fine-grained multimodal retrieval (e.g., +7–9% on grounding and composition subtasks) while matching zero-shot unimodal performance, and reduces training memory 30–60%.

4. Advantages, Limitations, and Best Practices

Advantages

Full Cross-Modal Interaction: Joint feature learning from the lowest layer enables discovery of subtle modality synergies, supporting tasks such as referring expression segmentation and grounding (Li et al., 2024, Xiao et al., 2024, Zhang et al., 2024).
Architectural Simplicity: A single encoder stack eliminates the need for specialized fusion modules, cross-attention bridges, or separate per-modality architectures (Li et al., 2024, Team, 2024).
Latency and Compute: Early fusion can reduce inference passes and enable more efficient, truly joint embedding of multimodal queries for retrieval and generation.

Limitations

Sensitivity to Misalignment and Noise: Immediate mixing before alignment or denoising makes early fusion vulnerable to registration errors or noise in any modality; this is particularly marked in medical or physical sensor fusion (Remedios et al., 2024, Li et al., 2024).
Scaling: Simple concatenation can explosively increase sequence length or channel count, resulting in prohibitive compute for high-resolution multimodal inputs. Factorized fusion and token bottlenecking may mitigate this (Mo et al., 2023, Li et al., 2024).
Catastrophic Representation Entanglement: Early fusion may obscure unimodal sub-spaces and complicate fine-tuning for tasks that rely on mode-specific information (Li et al., 2024).

Practical Recommendations

Challenge	Best Practice
Compute/memory cost	Bottleneck tokens or factorized fusion (Mo et al., 2023)
Misalignment/noise	Alignment preprocessing or gated fusion (Remedios et al., 2024, Li et al., 2024)
Mixed-modality capacity	Use modality dropout and balanced sampling in training (Li et al., 2024, Team, 2024)
Token inflation	Compression by VQ/token bottlenecks (Schlarmann et al., 3 Jun 2025)

5. Case Studies and Application Domains

Early fusion one-tower architectures are utilized in a diverse array of settings:

Medical Imaging: Multi-channel data-level fusion improves organ segmentation, but requires careful handling of registration and contrast heterogeneity (Remedios et al., 2024).
Multimodal Retrieval and Generation: The one-tower fusion approach enables unified fine-grained context modeling in open-vocabulary vision-language retrieval (Huang et al., 27 Feb 2025), mixed-modal generative LLMs (Team, 2024), and speech–text question answering (Dao et al., 2024).
Grounding and Referring Tasks: In referring segmentation and grounding, early fusion with a shared transformer achieves state-of-the-art accuracy with more efficient computation (e.g., OneRef and EVF-SAM reduce parameters and FLOPs by up to 4× over two-tower counterparts) (Xiao et al., 2024, Zhang et al., 2024).
Contrastive and Masked Multimodal Embedding: Unified token spaces in FuseLIP support efficient pretraining and compositionally challenging VQA and retrieval queries (Schlarmann et al., 3 Jun 2025).
Long-Distance Feature Fusion in Transformers: Explicitly learned or gated fusion of content and positional (or modality) embeddings improves long-sequence modeling (Hallam et al., 9 Jan 2026).
Dense Audio-Visual Perception: Dense early fusion and factorized attention enables efficient and effective multimodal segmentation, classification, and localization (Mo et al., 2023).
Realtime Voice Assistance: Quantized token-based early fusion in Ichigo enables low-latency speech–text reasoning and generation with minimal parameter expansion and almost no loss of unimodal LLM performance (Dao et al., 2024).

6. Empirical Trends and Future Directions

Three dominant empirical trends emerge across recent literature:

Task/Domain Dependence: Gains from early fusion are pronounced for tasks involving fine-grained cross-modal alignment (grounding, composition, retrieval, dense segmentation), but can be less so for simple multimodal classification, and may underperform late fusion in scenarios with strong unimodal pretraining and few-shot learning (Tziafas et al., 2022, Remedios et al., 2024, Schlarmann et al., 3 Jun 2025).
Training Stability and Scaling: Mixed-modal tokenization and shared attention architectures require specialized normalization, tailored curricula, or architectural modifications (e.g., QK-Norm, norm reordering, z-loss) to ensure stable convergence at scale (Team, 2024).
Unified Foundation Models: Large, early-fusion one-tower encoders (Chameleon, Ichigo, FuseLIP) demonstrate that truly integrated, multimodal modeling across language, vision, and audio is feasible and effective with discrete, bottlenecked token spaces—enabling flexible generation and understanding in arbitrary sequence order (Schlarmann et al., 3 Jun 2025, Team, 2024, Dao et al., 2024).

Current research prioritizes interpretable gating, fused token compression, and curriculum design to further enhance the reliability and efficiency of early fusion at scale, as well as best practices for robust multimodal alignment in high-noise regimes (Li et al., 2024, Mo et al., 2023).