Papers
Topics
Authors
Recent
2000 character limit reached

DeepStack-L: Efficient LMM Design

Updated 6 December 2025
  • DeepStack-L is a variant of the DeepStack architecture that partitions visual tokens into groups and injects each group at different transformer layers via residual addition.
  • It replaces full visual token injection with a layer-wise scheme, reducing computational cost from (C+V)² to N*(C+g)²+(L-N)*C², saving 70–80% FLOPs and memory.
  • Empirical evaluations on VQA benchmarks show DeepStack-L closing over 90% of the performance gap of full-length models with much lower context length requirements.

DeepStack-L is a variant of the DeepStack architecture for large multimodal models (LMMs), designed to address the computation and memory efficiency bottlenecks inherent in conventional LMM designs that process extensive visual token sequences. Unlike standard approaches, which inject the full set of visual tokens as a prefix to the first transformer layer of a LLM, DeepStack-L partitions visual tokens into multiple groups and injects each group at a different transformer layer via residual addition. This layer-wise stacking scheme enables efficient utilization of the LLM’s capacity without modifications to the underlying model architecture, resulting in significant savings in floating-point operations (FLOPs), memory consumption, and context length required for high-resolution vision-language tasks (&&&0&&&).

1. Architecture and Layer-wise Injection Mechanism

For DeepStack-L, the visual encoder produces VV visual tokens XRV×dX \in \mathbb{R}^{V \times d}, where dd is the hidden size. These tokens are divided into NN equally sized groups X(i)X^{(i)} of size g=V/Ng = V/N. Each group is aligned and injected into one of the first NN decoder layers of the LLM. Specifically, for layers 1,,N\ell_1, \ldots, \ell_N (typically contiguous), token group X(i)X^{(i)} is added via residual connection to the hidden states at designated visual prefix token indices.

The pseudocode for the core injection loop is as follows:

1
2
3
4
5
6
7
8
9
10
def forward(H0, X, l0, step, vis_pos):
    # H0: standard token embeddings (text)
    # X_stack: list of N groups of visual tokens
    H = H0
    for idx, layer in enumerate(LLM_layers):
        if idx >= l0 and (idx - l0) % step == 0:
            group_idx = (idx - l0) / step
            H[vis_pos] = H[vis_pos] + X_stack[group_idx]  # residual infusion
        H = layer(H)
    return H

Typically, l0=0l_0 = 0 and step=1step = 1, so each of the first NN layers receives exactly one group. The remaining LNL-N layers of the LLM operate solely on text or processed features.

A block-diagram overview:

  • Visual encoder produces XX (VV tokens)
    • XX is split into X(1),...,X(N)X^{(1)}, ..., X^{(N)}
  • [Text-only Layer 0]
    • +X(1)+X^{(1)} ➔ Layer 1 ➔ +X(2)+X^{(2)} ➔ Layer 2 … +X(N)+X^{(N)} ➔ Layer N ➔ Layers N+1N+1LL ➔ Output

2. Mathematical and Computational Formalism

Let VV be the total number of visual tokens, NN the number of groups/layers for injection, and CC the context length for textual tokens. Index sets IiI_i partition XX as:

Ii={(i1)g+1,,ig},X(i)=X[Ii,:]I_i = \{ (i-1)\cdot g + 1, \ldots, i\cdot g \},\quad X^{(i)} = X[I_i, :]

For transformer layers PP_\ell, and hidden states H()H^{(\ell)}, injection is formalized as:

  • H(0)=[X;TextPrompt]H^{(0)} = [X; \text{TextPrompt}]
  • For =1...L\ell = 1...L:
    • If {1,...,N}\ell \in \{\ell_1, ..., \ell_N\}, say i\ell_i, then H(1)+[0;X(i)]H^{(\ell-1)} + [0; X^{(i)}] (only on visual-token positions)
    • Else, H(1)H^{(\ell-1)} unchanged

Each PP_\ell applies standard self-attention without new cross-attention modules or trainable parameters. Thus, the only architectural overhead is the bookkeeping of injection indices.

From a complexity perspective, standard LMMs incur a per-layer computational cost of approximately (C+V)2d(C+V)^2 \cdot d over LL layers; DeepStack-L reduces this to N(C+g)2d+(LN)C2dN \cdot (C+g)^2 \cdot d + (L - N) \cdot C^2 \cdot d. For practical parameters (e.g., C=576C=576, V=2880V=2880, N=4N=4, g=720g=720), compute savings are approximately 70–80% compared to baseline LMMs, and memory requirements decrease proportionally since only a subset of visual tokens are present in each injection window.

3. Implementation Hyperparameters and Details

For Vicuna-7B integration with DeepStack-L:

  • N=4N=4 groups of g=720g=720 tokens (total V=2880V=2880)
  • l0=0l_0=0, step=1step=1, so injections occur at layers 1, 2, 3, 4 (32-layer LLaVA-1.5 baseline)
  • C=576C=576 context length (only 576 tokens as “prefix” in the input; all 2880 tokens are seen via stacking)
  • In pretraining (PT), only the 1-layer projection head is optimized; the LLM is kept frozen.
  • In supervised fine-tuning (SFT), the LLM is unfrozen, new token embeddings initialized to zero, warmup ratio 3%, learning rate 2×1052\times 10^{-5}.

No additional parameters or model modifications are required—DeepStack-L uses the projection MLP from LLaVA-1.5 and introduces zero net increase in parameter count. The sole “module” addition is a management layer for residual fusion.

4. Empirical Evaluation

DeepStack-L exhibits pronounced gains over conventional LMMs, with performance on text-oriented visual question answering (VQA) tasks and general large multimodal benchmarks summarized below for 7B-parameter models with C=576C=576:

Task LLaVA-1.5 DeepStack-L Δ\Delta
TextVQA 58.2 62.4 +4.2
DocVQA 28.1 39.1 +11.0
InfoVQA 25.8 29.8 +4.0
Dataset LLaVA-1.5 DeepStack-L Δ\Delta
VQAv2 78.5 79.5 +1.0
GQA 62.0 63.1 +1.1
SEED 58.6 60.6 +2.0
POPE 85.9 86.7 +0.8

With only one-fifth the context length (C=576C=576 versus $2880$), DeepStack-L closes more than 90% of the performance gap relative to full-length LLaVA-Next on text-oriented VQA, while requiring roughly 20% of the sequence input size. This effect is amplified on high-resolution tasks (TextVQA, DocVQA, InfoVQA), where DeepStack-L yields absolute improvements of 4.2, 11.0, and 4.0 percentage points, respectively, over the LLaVA-1.5-7B baseline.

5. Practical Recommendations

Key guidelines for integrating DeepStack-L with projector-based LMMs:

  • For high-volume visual tokens (e.g., multi-crop, video frames), maintain a small LLM context window CC while setting N4N\approx4–8 to partition VV into manageable groups.
  • Prefer group size g=V/Ng=V/N to be close to CC (to keep window size modest).
  • Use spatially consistent or dilated token grouping such that each X(i)X^{(i)} covers unique, contiguous image regions.
  • For 4K or higher resolutions, increase NN or reduce gg to satisfy (C+g)2C(C+g)\lesssim 2C.
  • Default to simple residual addition for fusion (no gating required), initializing injected token embeddings to zero for training stability.
  • Optionally, fine-tune the vision encoder with a low learning rate (e.g., 1×1061\times10^{-6}) to adapt features for layer-wise stacking.

This methodology is compatible with LMMs such as LLaVA, Vicuna, and Phi-3, providing substantial effective visual token capacity expansion with minimal compute or parameter overhead (Meng et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DeepStack-L.