DeepStack Multi-Level Fusion

Updated 21 January 2026

DeepStack Multi-Level Fusion is a multimodal integration strategy that progressively injects grouped visual tokens into transformer layers to reduce computational cost and increase representational power.
It employs a residual-style, layerwise token injection mechanism that partitions high-resolution visual data to substantially reduce attention complexity.
Empirical benchmarks show that DeepStack consistently outperforms traditional early fusion models in tasks like VQA, TextVQA, and DocVQA while using a fraction of the computational resources.

DeepStack Multi-Level Fusion is a strategy for multimodal integration within transformer-based architectures, particularly Large Multimodal Models (LMMs), designed to improve both computational efficiency and representational power in visual-language reasoning tasks. Rather than concatenating all visual tokens with text at the initial input stage—a common approach that incurs significant quadratic costs in memory and computation—DeepStack injects groups of visual tokens progressively at different depths of the transformer stack. This multi-level fusion paradigm leads to substantial reductions in sequence length per layer, increased expressivity across modalities, and empirically validated improvements in benchmarks over traditional single-stage fusion. The framework is conceptually related to previous sensor fusion strategies such as CentralNet, but specialized for high-resolution visual token streaming in vision-LLMs (Meng et al., 2024, Vielzeuf et al., 2018).

1. Formal Framework and Multi-Level Fusion Mechanism

Let $V = [v_1, v_2, \dots, v_M]$ denote the set of $M$ visual tokens, extracted for instance via a CLIP-style vision encoder, and $N$ be the number of transformer layers in the LLM or vision transformer. DeepStack partitions $V$ into $N$ contiguous groups (chunks) of roughly equal size $\lceil M/N\rceil$ , so that each $V^{(i)} = V[(i-1)\cdot\lceil M/N\rceil : i\cdot\lceil M/N\rceil]$ with $i=1,\dots,N$ . At transformer layer $i$ , the text token sequence $T$ is concatenated with the visual group $V^{(i)}$ for that specific layer. The resulting hidden state is then given by

$X^{(i)} = \mathrm{TransformerLayer}_i\bigl[T ; V^{(i)}\bigr]$

where “[ ; ]” denotes concatenation along the token dimension. In practice, DeepStack employs a residual-style injection at designated layers—incoming visual tokens are added elementwise into reserved prefix positions of the hidden state tensor $H$ :

for idx, layer in enumerate(self.transformer_layers):
    if idx >= l_start and (idx - l_start) % interval == 0:
        block_id = (idx - l_start) // interval
        H[:, vis_pos, :] += V_stack[block_id]
    H = layer(H)

This process accumulates fine-grained visual information in the lower part of the transformer, while later layers focus on sequence modeling over the progressively enriched representation (Meng et al., 2024).

2. Comparisons to Single-Stage (Early) Fusion Architectures

Standard early fusion (baseline) approaches, exemplified by models such as LLaVA-1.5, prepend the entire set of $M$ visual tokens to the text tokens and input this sequence into every transformer layer. This results in a constant sequence length $S_\text{base} = |T| + M$ at every layer and an overall attention complexity of $N \cdot O((|T|+M)^2 \cdot d)$ . By contrast, DeepStack restricts the number of injected visual tokens to $M/N$ per designated layer, so for those layers

$S_i = |T| + |V^{(i)}| \approx |T| + M/N$

The full model thus attains attention costs of approximately $N \cdot O((|T| + M/N)^2 \cdot d)$ and peak attention matrix size reduced by up to a factor of $N^2$ . For scenarios where $M \gg |T|$ , this translates to dramatic reduction in total and peak quadratic costs. Empirically, DeepStack achieves these efficiency gains while matching or exceeding the accuracy of full-length single-stage fusion models (Meng et al., 2024).

3. Algorithmic Implementation and Pseudocode

The DeepStack fusion pipeline comprises the following key algorithmic stages:

Compute initial embeddings for text tokens (and visual placeholder positions) to yield $H_0$ .
For $i$ in $1$ to $N$ : a. If layer $i$ is a fusion layer, add $V^{(i)}$ into visual positions of $H_{i-1}$ . b. Compute $H_i = \mathrm{TransformerLayer}_i(H_{i-1})$ .
Decode model output from $H_N$ .

In code, high-level logic is captured as follows:

def forward(H, V_stack, l_start, interval, vis_pos):
    for idx, layer in enumerate(self.transformer_layers):
        if idx >= l_start and (idx - l_start) % interval == 0:
            block_id = (idx - l_start) // interval
            H[:, vis_pos, :] += V_stack[block_id]
        H = layer(H)
    return H

This architecture does not introduce new parameters; visual token injection is achieved purely via residual addition. The schedule—choice of starting layer, injection interval, and chunk count—is tuned empirically (Meng et al., 2024).

4. Empirical Performance and Benchmarks

DeepStack architectures, when evaluated on nine vision-language datasets (VQA-v2, GQA, TextVQA, DocVQA, InfoVQA, SEED, POPE, MM-MU, MM-Vet), yield consistent improvements in accuracy. With a 7B parameter LLM and fixed context length (576 tokens), DeepStack-L achieves:

Task	Baseline (LLaVA-1.5-7B)	DeepStack-L	Δ Improvement
TextVQA	58.2	62.4	+4.2
DocVQA	28.1	39.1	+11.0
InfoVQA	25.8	29.8	+4.0

Average gain across all nine tasks is approximately +2.7 points. In high-resolution settings (DeepStack-L-HD, with 4× visual tokens and context length 2,880), further improvements are observed: e.g., TextVQA at 66.7 (+8.5 over baseline), DocVQA at 78.8 (+50.7). Notably, reducing context length fivefold (2,880 to 576 tokens) with DeepStack incurs negligible accuracy loss, still matching or surpassing full-length baselines—demonstrating substantial gains in both efficiency and performance (Meng et al., 2024).

5. Benefits, Scalability, and Design Limitations

DeepStack’s progressive stacking imparts multiple benefits:

Efficiency: Quadratic attention and memory costs scale with $(M/N)^2$ per layer rather than $M^2$ , providing an $N$ -fold reduction in dominant terms.
Expressivity: Layerwise injection of visual tokens enables the transformer to first “encode” high-resolution image details, then perform inter-modality sequence modeling as the stack proceeds.
Scalability: Without altering the transformer’s global context window, DeepStack accommodates substantially more visual tokens (e.g., 4×) at modest additional cost.
Empirical Effectiveness: Consistent improvements of +2–4 points on general VQA and +4–11 points on text/document VQA tasks relative to early fusion baselines.

Several limitations are recognized:

Heuristic Scheduling: The choice of injection slot (start layer, interval, stack count) is empirical rather than learned.
Residual-Only Fusion: The present mechanism lacks dynamic gating or learned cross-attention for token integration.
Uniform Partitioning: Static, equal-sized token blocks may fall short of optimal partitions for certain tasks or images.
Implementation Overhead: Requires moderate code modification to enable mid-stack token injection, but introduces no extra trainable parameters (Meng et al., 2024).

Multi-level fusion in DeepStack is conceptually linked to hierarchical fusion architectures in multi-sensor integration, notably CentralNet (Vielzeuf et al., 2018). In CentralNet, fusion is performed at each abstraction layer by interleaving a “central” network with multiple unimodal backbones. The fused state at each depth $i$ is given by:

$h_{C_{i+1}} = \alpha_{C_i}\,h_{C_i} + \sum_{k=1}^N \alpha_{M_i^k}\,h_{M_i^k}$

where $\alpha$ are trainable scalars, and $h_{C_{i}}$ , $h_{M_i^k}$ are central and modality-specific hidden states, respectively. CentralNet employs a multi-objective loss, enforcing each modality’s expressiveness and the central net’s discrimination. The fusion weights $\alpha$ adaptively enable mixtures of early and late fusion in an end-to-end differentiable manner.

While DeepStack specializes the multi-level fusion paradigm to sequential transformer models and high-resolution visual streaming, both methods demonstrate that distributing fusion points throughout the network depth can outperform both early and late single-point fusion. DeepStack’s empirical findings in vision-language settings parallel CentralNet’s observations across multimodal audio-visual and text-image domains, reinforcing the generality and effectiveness of progressive, layered fusion (Meng et al., 2024, Vielzeuf et al., 2018).

Markdown Upgrade to Chat

References (2)

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs (2024)

Multi-Level Sensor Fusion with Deep Learning (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepStack Multi-Level Fusion.

DeepStack Multi-Level Fusion

1. Formal Framework and Multi-Level Fusion Mechanism

2. Comparisons to Single-Stage (Early) Fusion Architectures

3. Algorithmic Implementation and Pseudocode

4. Empirical Performance and Benchmarks

5. Benefits, Scalability, and Design Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

DeepStack Multi-Level Fusion

1. Formal Framework and Multi-Level Fusion Mechanism

2. Comparisons to Single-Stage (Early) Fusion Architectures

3. Algorithmic Implementation and Pseudocode

4. Empirical Performance and Benchmarks

5. Benefits, Scalability, and Design Limitations

6. Contextualization with Related Multi-Level Fusion Strategies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research