DeepStack-L: Efficient LMM Design
- DeepStack-L is a variant of the DeepStack architecture that partitions visual tokens into groups and injects each group at different transformer layers via residual addition.
- It replaces full visual token injection with a layer-wise scheme, reducing computational cost from (C+V)² to N*(C+g)²+(L-N)*C², saving 70–80% FLOPs and memory.
- Empirical evaluations on VQA benchmarks show DeepStack-L closing over 90% of the performance gap of full-length models with much lower context length requirements.
DeepStack-L is a variant of the DeepStack architecture for large multimodal models (LMMs), designed to address the computation and memory efficiency bottlenecks inherent in conventional LMM designs that process extensive visual token sequences. Unlike standard approaches, which inject the full set of visual tokens as a prefix to the first transformer layer of a LLM, DeepStack-L partitions visual tokens into multiple groups and injects each group at a different transformer layer via residual addition. This layer-wise stacking scheme enables efficient utilization of the LLM’s capacity without modifications to the underlying model architecture, resulting in significant savings in floating-point operations (FLOPs), memory consumption, and context length required for high-resolution vision-language tasks (&&&0&&&).
1. Architecture and Layer-wise Injection Mechanism
For DeepStack-L, the visual encoder produces visual tokens , where is the hidden size. These tokens are divided into equally sized groups of size . Each group is aligned and injected into one of the first decoder layers of the LLM. Specifically, for layers (typically contiguous), token group is added via residual connection to the hidden states at designated visual prefix token indices.
The pseudocode for the core injection loop is as follows:
1 2 3 4 5 6 7 8 9 10 |
def forward(H0, X, l0, step, vis_pos): # H0: standard token embeddings (text) # X_stack: list of N groups of visual tokens H = H0 for idx, layer in enumerate(LLM_layers): if idx >= l0 and (idx - l0) % step == 0: group_idx = (idx - l0) / step H[vis_pos] = H[vis_pos] + X_stack[group_idx] # residual infusion H = layer(H) return H |
Typically, and , so each of the first layers receives exactly one group. The remaining layers of the LLM operate solely on text or processed features.
A block-diagram overview:
- Visual encoder produces ( tokens)
- is split into
- [Text-only Layer 0]
- ➔ Layer 1 ➔ ➔ Layer 2 … ➔ Layer N ➔ Layers … ➔ Output
2. Mathematical and Computational Formalism
Let be the total number of visual tokens, the number of groups/layers for injection, and the context length for textual tokens. Index sets partition as:
For transformer layers , and hidden states , injection is formalized as:
- For :
- If , say , then (only on visual-token positions)
- Else, unchanged
Each applies standard self-attention without new cross-attention modules or trainable parameters. Thus, the only architectural overhead is the bookkeeping of injection indices.
From a complexity perspective, standard LMMs incur a per-layer computational cost of approximately over layers; DeepStack-L reduces this to . For practical parameters (e.g., , , , ), compute savings are approximately 70–80% compared to baseline LMMs, and memory requirements decrease proportionally since only a subset of visual tokens are present in each injection window.
3. Implementation Hyperparameters and Details
For Vicuna-7B integration with DeepStack-L:
- groups of tokens (total )
- , , so injections occur at layers 1, 2, 3, 4 (32-layer LLaVA-1.5 baseline)
- context length (only 576 tokens as “prefix” in the input; all 2880 tokens are seen via stacking)
- In pretraining (PT), only the 1-layer projection head is optimized; the LLM is kept frozen.
- In supervised fine-tuning (SFT), the LLM is unfrozen, new token embeddings initialized to zero, warmup ratio 3%, learning rate .
No additional parameters or model modifications are required—DeepStack-L uses the projection MLP from LLaVA-1.5 and introduces zero net increase in parameter count. The sole “module” addition is a management layer for residual fusion.
4. Empirical Evaluation
DeepStack-L exhibits pronounced gains over conventional LMMs, with performance on text-oriented visual question answering (VQA) tasks and general large multimodal benchmarks summarized below for 7B-parameter models with :
| Task | LLaVA-1.5 | DeepStack-L | |
|---|---|---|---|
| TextVQA | 58.2 | 62.4 | +4.2 |
| DocVQA | 28.1 | 39.1 | +11.0 |
| InfoVQA | 25.8 | 29.8 | +4.0 |
| Dataset | LLaVA-1.5 | DeepStack-L | |
|---|---|---|---|
| VQAv2 | 78.5 | 79.5 | +1.0 |
| GQA | 62.0 | 63.1 | +1.1 |
| SEED | 58.6 | 60.6 | +2.0 |
| POPE | 85.9 | 86.7 | +0.8 |
With only one-fifth the context length ( versus $2880$), DeepStack-L closes more than 90% of the performance gap relative to full-length LLaVA-Next on text-oriented VQA, while requiring roughly 20% of the sequence input size. This effect is amplified on high-resolution tasks (TextVQA, DocVQA, InfoVQA), where DeepStack-L yields absolute improvements of 4.2, 11.0, and 4.0 percentage points, respectively, over the LLaVA-1.5-7B baseline.
5. Practical Recommendations
Key guidelines for integrating DeepStack-L with projector-based LMMs:
- For high-volume visual tokens (e.g., multi-crop, video frames), maintain a small LLM context window while setting –8 to partition into manageable groups.
- Prefer group size to be close to (to keep window size modest).
- Use spatially consistent or dilated token grouping such that each covers unique, contiguous image regions.
- For 4K or higher resolutions, increase or reduce to satisfy .
- Default to simple residual addition for fusion (no gating required), initializing injected token embeddings to zero for training stability.
- Optionally, fine-tune the vision encoder with a low learning rate (e.g., ) to adapt features for layer-wise stacking.
This methodology is compatible with LMMs such as LLaVA, Vicuna, and Phi-3, providing substantial effective visual token capacity expansion with minimal compute or parameter overhead (Meng et al., 2024).