FastV: Efficient Acceleration for Vision-Language Models

Updated 20 February 2026

FastV is a comprehensive approach that prunes redundant visual tokens in LVLMs to achieve up to 36% latency reduction without significant accuracy loss.
It employs a two-stage process with early full integration followed by importance-based token selection, resulting in up to a 45% reduction in computational FLOPs.
FastV also extends to other domains like video diffusion and flow volume estimation, showcasing versatile acceleration strategies across multiple vision applications.

FastV is a term used for multiple algorithmically distinct approaches across machine learning, ranging from inference acceleration in large vision-LLMs (LVLMs), fast processing in vision-oriented state-space models, stream-based flow volume estimation, to rapid, high-compression video diffusion models and retrieval-augmented generative (RAG) frameworks for video understanding. This article surveys the principal “FastV” systems in contemporary literature, emphasizing methodologies, quantitative performance gains, empirical findings, and limitations. The most referenced context, and the canonical use of “FastV” in recent LVLM systems, is the plug-and-play, training-free, mid-inference token-pruning strategy described in “An Image is Worth 1/2 Tokens After Layer 2” (Chen et al., 2024). Where multiple approaches or meanings are relevant, clear distinctions are maintained.

1. Definition and Motivation

In vision-language transformer architectures, especially LVLMs like LLaVA-1.5, QwenVL-Chat, and Video-LLaVA, input sequences contain a high proportion of visual (image or video frame) tokens. Empirical analyses demonstrate that attention to these tokens becomes negligible in deep transformer layers; for example, after layer 2, over 85% of overall attention is assigned to text “anchor” tokens, with the mean per-image-token attention ratio ε_img falling below 0.21% of that for prompts. This redundancy motivates pruning: retaining only a small, most-informative subset of visual tokens after an initial integration phase can yield substantial computational savings without sacrificing task performance (Chen et al., 2024).

The term "FastV" has also been adopted by other research lines addressing model acceleration in orthogonal settings:

Streaming flow volume estimation by constant-time algorithms (Basat et al., 2017).
Efficient spatial processing in state-space models for vision (Fast Vision Mamba) (Kapse et al., 1 Feb 2025).
Highly compressed video diffusion (FSVideo) (Team et al., 2 Feb 2026).
Rapid speculative decoding in video retrieval-augmented generation (FastV-RAG) (Li et al., 4 Jan 2026).

2. Core Algorithm: FastV in Vision-LLMs

FastV (Chen et al., 2024) operates on encoder-decoder-style LVLMs (e.g., LLaVA, QwenVL-Chat, Video-LLaVA). Its workflow comprises two stages:

A. Early-layer Full Integration (Layers 1…K):

Unaltered multihead self-attention, mixing full sequences of text and visual tokens.
Information from images is aggregated into anchor tokens (prompt, instruction, or outputs).

B. Importance-based Visual Token Pruning (Layers K+1…T):

At pruning layer K, compute an importance score for each image token:

$\varphi_\mathrm{attn}(x_n) = \frac{1}{H} \sum_{i=1}^N \sum_{h=1}^H \alpha_{h,i \to n}^{(K)},$

where $\alpha_{h,i \to n}^{(K)}$ is the attention from output position $i$ (head $h$ ) to visual token $n$ .

Retain top $(1-R)\cdot n$ visual tokens with highest $\varphi_\mathrm{attn}$ , discard the rest.
All subsequent transformer layers process text tokens and only the retained visual set; masked tokens are dropped from both self-attention and FFN computations.

Pseudocode outlining core FastV inference is given as follows:

def FastV_Infer(model, image_tokens, text_tokens, K, R):
    V = image_tokens
    TX = text_tokens
    for j in range(1, K+1):
        V, TX = TransformerLayer_j(V, TX)
    phi = zeros(len(V))
    for h in range(H):
        for i in range(N_out):
            for v in V:
                phi[v] += AttentionWeight(h, i, v)
    phi /= (H * N_out)
    keep = TopKIndices(phi, k=int((1-R) * len(V)))
    V_reduced = [V[i] for i in keep]
    for j in range(K+1, T+1):
        V_reduced, TX = TransformerLayer_j(V_reduced, TX)
    return GenerateOutputs(TX)

3. Quantitative Acceleration and Empirical Results

FastV yields significant computational and inference-time reductions:

In LLaVA-1.5-13B (T=32, n≈576 visual tokens), with (K=2, R=0.5), per-layer visual token count drops from 576 → 288 for layers 3…32. Equation (6) in (Chen et al., 2024) predicts a 45% theoretical FLOPs cut.
Benchmark data on an A40 GPU: vanilla 13B latency is 0.539 s/example; FastV reduces this to 0.341 s (36% speed-up).
Average accuracy across image/video captioning and VQA tasks is stable (e.g. 99.7 → 100.9 CIDEr on Nocaps, 82.0 → 81.3 Acc on A-OKVQA) at 55% FLOPs.
No retraining or architecture change is required; FastV is entirely plug-and-play.

Table: Key results from (Chen et al., 2024) (image/video tasks, 13B model, K=2, R=0.5):

Task/Metric	Baseline (full tokens)	FastV (K=2, R=0.5)
FLOPs Ratio	100%	55%
Latency (s/ex)	0.539	0.341
Main Accuracy	see text	No significant drop

FastV is especially effective for long visual sequences (high-res images, videos) due to their high redundancy.

One-shot Pruning: Strategies such as those implemented in DyCoke (“FastV (one-shot)”) (Tao et al., 2024), which prune tokens at a single step, yield substantial speed-up (e.g., 1.32×, with 43% FLOPs ratio) but incur slightly more performance loss compared to DyCoke's dynamic pruning (1.54×, same FLOPs ratio but full recovery of baseline, Table 1).

Dynamic Pruning (as in DyCoke): DyCoke (Tao et al., 2024) extends the FastV paradigm by introducing dynamic key-value cache pruning, re-inserting potentially relevant tokens during autoregressive decoding, and temporally merging redundant frame tokens prior to LLM input. This yields further memory and speed gains (up to 1.5×, 1.4× memory reduction) versus one-shot FastV, especially in video settings.

FSVideo and Vision-Mamba (FastVim): The “FastV” designation is also used to refer to Fast Vision Mamba, which accelerates parallel recurrent state-space models by spatial pooling, shrinking the effective context and halving parallel scan depth per block (Kapse et al., 1 Feb 2025). FSVideo (“Fast Speed Video Diffusion”) exploits highly compressed latent spaces and memory-efficient transformer blocks for rapid video synthesis, achieving up to 42× speed-up over standard 14B diffusion baselines (Team et al., 2 Feb 2026).

FastV-RAG: In RAG-based video QA, "FastV" refers to speculative decoding pipelines that accelerate retrieval-grounded answer synthesis by splitting expensive inference into multiple parallel lightweight drafts and a single heavyweight verification pass, leading to 2× speed-up with unchanged accuracy (Li et al., 4 Jan 2026).

5. Applications, Deployment, and Limitations

Scenarios:

LVLM edge deployment, where resource budgets (memory, compute, latency) are highly constrained.
Video-level reasoning tasks, where the number of input tokens is extremely large.

Practical Considerations:

FastV requires only the addition of a runtime token importance computation module; no training or fine-tuning is required.
Filtering layer $K$ and pruning ratio $R$ provide a tunable Pareto frontier; $K=2$ , $R=0.5$ yields strong trade-off points.
For high redundancy tasks (video, long imaging), FastV provides maximum relative benefit.

Limitations:

Fine-grained tasks (OCR, small-object reasoning) may require higher $K$ (i.e., later pruning) and lower $R$ , as aggressive pruning can degrade accuracy on subtle cues.
Specialized attention kernels (e.g. FlashAttention, vLLM) may reduce observed gains, since non-token-compute overheads can dominate total latency.
Actual speedup varies with batch size, hardware, and model implementation.

6. Extensions: Other “FastV” Algorithms

Beyond transformer vision-language acceleration:

Network Flow Volume Estimation: FastV (FAST/WFAST) provides constant-time, one-sided error, and memory-efficient algorithms for stream/slide-based weighted flow analytics. Empirical evaluation demonstrates up to 2.4×–7× speed-up versus prior heavy-hitter solutions on real traces (Basat et al., 2017).
Speculative RAG Pipelines: FastV-RAG achieves 1.85×–2.43× speed-ups for VideoSimpleQA and Encyclopedic VQA tasks, preserving or improving answer accuracy via fast draft + single verify steps and CLIP-based entity alignment (Li et al., 4 Jan 2026).

7. Discussion and Future Directions

The FastV approach, whether interpreted as attention-based token pruning, dynamic visual context reduction, or fast draft-verify generative pipelines, has demonstrated wide empirical merit across multiple machine learning subfields:

In vision-LLMs, it exposes and exploits redundancies in the attention mechanisms, translating directly to practical speed and resource gains.
In state-space and diffusion-based models, analogous collapses of spatial/temporal redundancy (by spatial pooling, high-ratio latent compression, or similar) drive similar acceleration trends.
In knowledge-augmented video QA, speculative decoding frameworks (“FastV-RAG”) exemplify how architectural decompositions can balance throughput and accuracy.

A plausible implication is that as multimodal and generative model input contexts scale, runtime-efficient, training-free context reduction policies akin to FastV will become increasingly central for deployment at scale and for edge intelligence applications. Open research questions remain around optimal token importance heuristics, fully adaptive multi-stage pruning, and integrating FastV-style approaches with highly optimized inference libraries. Further, “FastV”’s meaning will continue to drift as the community re-applies its central principles of aggressive, data-driven context compression to broader model classes.