DeepStack-L: Efficient LMM Design

Updated 6 December 2025

DeepStack-L is a variant of the DeepStack architecture that partitions visual tokens into groups and injects each group at different transformer layers via residual addition.
It replaces full visual token injection with a layer-wise scheme, reducing computational cost from (C+V)² to N*(C+g)²+(L-N)*C², saving 70–80% FLOPs and memory.
Empirical evaluations on VQA benchmarks show DeepStack-L closing over 90% of the performance gap of full-length models with much lower context length requirements.

DeepStack-L is a variant of the DeepStack architecture for large multimodal models (LMMs), designed to address the computation and memory efficiency bottlenecks inherent in conventional LMM designs that process extensive visual token sequences. Unlike standard approaches, which inject the full set of visual tokens as a prefix to the first transformer layer of a LLM, DeepStack-L partitions visual tokens into multiple groups and injects each group at a different transformer layer via residual addition. This layer-wise stacking scheme enables efficient utilization of the LLM’s capacity without modifications to the underlying model architecture, resulting in significant savings in floating-point operations (FLOPs), memory consumption, and context length required for high-resolution vision-language tasks (&&&0&&&).

1. Architecture and Layer-wise Injection Mechanism

For DeepStack-L, the visual encoder produces $V$ visual tokens $X \in \mathbb{R}^{V \times d}$ , where $d$ is the hidden size. These tokens are divided into $N$ equally sized groups $X^{(i)}$ of size $g = V/N$ . Each group is aligned and injected into one of the first $N$ decoder layers of the LLM. Specifically, for layers $\ell_1, \ldots, \ell_N$ (typically contiguous), token group $X^{(i)}$ is added via residual connection to the hidden states at designated visual prefix token indices.

The pseudocode for the core injection loop is as follows:

def forward(H0, X, l0, step, vis_pos):
    # H0: standard token embeddings (text)
    # X_stack: list of N groups of visual tokens
    H = H0
    for idx, layer in enumerate(LLM_layers):
        if idx >= l0 and (idx - l0) % step == 0:
            group_idx = (idx - l0) / step
            H[vis_pos] = H[vis_pos] + X_stack[group_idx]  # residual infusion
        H = layer(H)
    return H

Typically, $l_0 = 0$ and $step = 1$ , so each of the first $N$ layers receives exactly one group. The remaining $L-N$ layers of the LLM operate solely on text or processed features.

A block-diagram overview:

Visual encoder produces $X$ $X$ ( $V$ $V$ tokens)
- $X$ is split into $X^{(1)}, ..., X^{(N)}$
[Text-only Layer 0]
- $+X^{(1)}$ ➔ Layer 1 ➔ $+X^{(2)}$ ➔ Layer 2 … $+X^{(N)}$ ➔ Layer N ➔ Layers $N+1$ … $L$ ➔ Output

2. Mathematical and Computational Formalism

Let $V$ be the total number of visual tokens, $N$ the number of groups/layers for injection, and $C$ the context length for textual tokens. Index sets $I_i$ partition $X$ as:

$I_i = \{ (i-1)\cdot g + 1, \ldots, i\cdot g \},\quad X^{(i)} = X[I_i, :]$

For transformer layers $P_\ell$ , and hidden states $H^{(\ell)}$ , injection is formalized as:

$H^{(0)} = [X; \text{TextPrompt}]$
For $\ell = 1...L$ $ℓ = 1... L$ :
- If $\ell \in \{\ell_1, ..., \ell_N\}$ , say $\ell_i$ , then $H^{(\ell-1)} + [0; X^{(i)}]$ (only on visual-token positions)
- Else, $H^{(\ell-1)}$ unchanged

Each $P_\ell$ applies standard self-attention without new cross-attention modules or trainable parameters. Thus, the only architectural overhead is the bookkeeping of injection indices.

From a complexity perspective, standard LMMs incur a per-layer computational cost of approximately $(C+V)^2 \cdot d$ over $L$ layers; DeepStack-L reduces this to $N \cdot (C+g)^2 \cdot d + (L - N) \cdot C^2 \cdot d$ . For practical parameters (e.g., $C=576$ , $V=2880$ , $N=4$ , $g=720$ ), compute savings are approximately 70–80% compared to baseline LMMs, and memory requirements decrease proportionally since only a subset of visual tokens are present in each injection window.

3. Implementation Hyperparameters and Details

For Vicuna-7B integration with DeepStack-L:

$N=4$ groups of $g=720$ tokens (total $V=2880$ )
$l_0=0$ , $step=1$ , so injections occur at layers 1, 2, 3, 4 (32-layer LLaVA-1.5 baseline)
$C=576$ context length (only 576 tokens as “prefix” in the input; all 2880 tokens are seen via stacking)
In pretraining (PT), only the 1-layer projection head is optimized; the LLM is kept frozen.
In supervised fine-tuning (SFT), the LLM is unfrozen, new token embeddings initialized to zero, warmup ratio 3%, learning rate $2\times 10^{-5}$ .

No additional parameters or model modifications are required—DeepStack-L uses the projection MLP from LLaVA-1.5 and introduces zero net increase in parameter count. The sole “module” addition is a management layer for residual fusion.

4. Empirical Evaluation

DeepStack-L exhibits pronounced gains over conventional LMMs, with performance on text-oriented visual question answering (VQA) tasks and general large multimodal benchmarks summarized below for 7B-parameter models with $C=576$ :

Task	LLaVA-1.5	DeepStack-L	$\Delta$
TextVQA	58.2	62.4	+4.2
DocVQA	28.1	39.1	+11.0
InfoVQA	25.8	29.8	+4.0

Dataset	LLaVA-1.5	DeepStack-L	$\Delta$
VQAv2	78.5	79.5	+1.0
GQA	62.0	63.1	+1.1
SEED	58.6	60.6	+2.0
POPE	85.9	86.7	+0.8

With only one-fifth the context length ( $C=576$ versus $2880$), DeepStack-L closes more than 90% of the performance gap relative to full-length LLaVA-Next on text-oriented VQA, while requiring roughly 20% of the sequence input size. This effect is amplified on high-resolution tasks (TextVQA, DocVQA, InfoVQA), where DeepStack-L yields absolute improvements of 4.2, 11.0, and 4.0 percentage points, respectively, over the LLaVA-1.5-7B baseline.

5. Practical Recommendations

Key guidelines for integrating DeepStack-L with projector-based LMMs:

For high-volume visual tokens (e.g., multi-crop, video frames), maintain a small LLM context window $C$ while setting $N\approx4$ –8 to partition $V$ into manageable groups.
Prefer group size $g=V/N$ to be close to $C$ (to keep window size modest).
Use spatially consistent or dilated token grouping such that each $X^{(i)}$ covers unique, contiguous image regions.
For 4K or higher resolutions, increase $N$ or reduce $g$ to satisfy $(C+g)\lesssim 2C$ .
Default to simple residual addition for fusion (no gating required), initializing injected token embeddings to zero for training stability.
Optionally, fine-tune the vision encoder with a low learning rate (e.g., $1\times10^{-6}$ ) to adapt features for layer-wise stacking.

This methodology is compatible with LMMs such as LLaVA, Vicuna, and Phi-3, providing substantial effective visual token capacity expansion with minimal compute or parameter overhead (Meng et al., 2024).

PDF Markdown Chat (Pro)

References (1)

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DeepStack-L.