Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM (2505.15816v1)

Published 21 May 2025 in cs.CV

Abstract: Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

Summary

  • The paper identifies and reduces computation redundancy in vision tokens within Large Multimodal Models using a novel method called ProxyV.
  • ProxyV uses a small set of proxy tokens for heavy computations, efficiently updating the full set of vision tokens via a lightweight mechanism to preserve fine-grained details.
  • Experiments demonstrate that ProxyV achieves substantial computational savings (36-46% FLOPs) while maintaining or improving performance, particularly on tasks requiring fine-grained visual information.

This paper, "Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM" (2505.15816), addresses the significant computational burden imposed by vision tokens in Large Multimodal Models (LMMs), particularly those with a decoder-only architecture like LLaVA. Unlike traditional approaches that focus on reducing the number of vision tokens (token-level redundancy), this work identifies and exploits computation-level redundancy on these tokens. The core idea is that vision tokens, already processed by a vision encoder, might not require all the extensive computations (self-attention, FFNs) within every layer of the LLM decoder.

Current decoder-only LMMs concatenate vision tokens (after a projection layer) with text tokens and process them uniformly through multiple layers of self-attention and FFNs. This is computationally expensive, especially with high-resolution images or multiple images/videos which result in thousands of vision tokens. The quadratic complexity of self-attention exacerbates this problem. While token reduction methods (pruning or merging tokens) are popular, they risk losing fine-grained visual details crucial for tasks like OCR, document parsing, or detailed visual grounding. Furthermore, some token reduction methods based on text-to-image attention are incompatible with efficient attention implementations like FlashAttention.

The authors first empirically investigate the existence of computation-level redundancy. By masking the vision-to-vision attention computation in different layers of various LLM backbones (Vicuna, Llama3, Qwen2, Phi3, InternLM2.5), they find that masking attention in later layers has minimal impact on performance, especially on fine-grained visual tasks. This confirms that redundancy exists, particularly in the middle to rear parts of the LLM, though the exact layers vary across models. Finetuning with vision-to-all attention skipped partially mitigates the performance drop compared to training-free masking but shows limited FLOPs reduction due to the cost of FFNs.

Exploring further, the paper investigates skipping both attention and FFNs on vision tokens and replacing them with lightweight MLPs for updating the vision tokens. Directly skipping without replacement severely degrades performance. Replacing operations with lightweight MLPs significantly reduces FLOPs and time. Interestingly, this replacement can sometimes improve performance, which the authors attribute to the lightweight MLPs acting as decoupled vision-specific modules that better process vision-specific information without interfering with the LLM's core capabilities. However, simply skipping all operations still results in a performance drop, especially in early layers.

To overcome this, the authors propose ProxyV, a novel method designed to reduce computation while preserving performance. The key idea is to use a small set of "proxy" vision tokens to participate in the heavy computations (attention and FFNs) within the LLM layers, acting as intermediaries. These proxy tokens, after being updated by the heavy operations, then guide the update of the original, full set of vision tokens through a lightweight guided-update module.

In the spatial version of ProxyV, the proxy tokens are obtained by spatially downsampling the original vision tokens. For example, a 24×2424 \times 24 grid of tokens might be downsampled to a 6×66 \times 6 grid of proxy tokens (downsampling factor r=4r=4). Within an LLM layer, proxy tokens and text tokens form queries, while keys and values include proxy, full vision, and text tokens. Only proxy and text tokens go through the FFNs. The guided-update module uses a lightweight MLP to update each original vision token based on its corresponding spatially aligned proxy token. This design ensures that essential information from heavy computations is transferred to all original tokens via the proxies, mitigating performance loss while achieving substantial efficiency gains. Experiments show that applying ProxyV from middle or later layers maintains or even improves performance on fine-grained benchmarks (e.g., 101-102.4% relative score) with significant reductions in prefilling FLOPs (36-46%) and time (31-41%) across various LLM backbones. A preliminary paper suggests that the vision-specific MLPs improve text-vision alignment (lower MIR scores).

The paper compares ProxyV with state-of-the-art token reduction methods like VisionZip and PyramidDrop. While token reduction methods achieve similar performance on general benchmarks, they exhibit significant performance degradation on tasks requiring dense or fine-grained visual information, such as document parsing and visual grounding (RefCOCO), highlighting their inherent information loss problem. ProxyV, which preserves all vision tokens, performs much better in these sensitive scenarios.

Recognizing that computation reduction and token reduction are orthogonal, the paper also proposes a non-spatial variant of ProxyV to enable combination with token reduction methods. This variant uses learnable queries and attention to generate proxy tokens as weighted combinations of full tokens, without relying on spatial structure. It reuses the attention weights to guide the update of full tokens based on proxy tokens. This non-spatial ProxyV variant achieves comparable performance to the spatial one and can be combined with VisionZip to achieve further efficiency boosts (62% FLOPs reduction, 65% time reduction) while maintaining performance.

The experiments were conducted using a 2-stage training pipeline (projector and new modules pretraining, then instruction finetuning with LLM unfrozen) on various LLaVA-Next based models with AnyRes image encoding. Detailed configurations for ProxyV (e.g., downsampling factor, number of proxy tokens, MLP size) and baselines are provided. Performance metrics (FLOPs, time) are measured during the prefilling stage.

In conclusion, the paper successfully identifies and addresses computation-level redundancy in LMMs through the proposed ProxyV method. ProxyV achieves significant computational savings without sacrificing performance and can even enhance it, particularly when applied from middle or later layers. Its ability to preserve all vision tokens makes it robust for tasks requiring fine-grained details, unlike token reduction methods. The flexible non-spatial variant further allows for potential combinations with token reduction techniques for maximum efficiency.