SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs (2506.05344v1)

Published 5 Jun 2025 in cs.CV

Abstract: Multimodal LLMs (MLLMs) are commonly derived by extending pre-trained LLMs with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.

Summary

The paper reveals that less than 5% of attention heads are crucial for visual processing in MLLMs, challenging the assumption of uniform contribution.
The paper introduces a training-free KV-Cache optimization framework that quantifies head-level visual relevance to efficiently allocate computational resources.
The study achieves a 1.38× real-time acceleration and 52% memory reduction, indicating significant performance gains for scalable multimodal applications.

Overview of SparseMM: Head Sparsity in Multimodal LLMs

The paper "SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs" presents an incisive exploration into the sparsity phenomenon observed in Multimodal LLMs (MLLMs) and introduces SparseMM, a computationally efficient KV-Cache optimization framework. Authored by researchers from Tsinghua University and Tencent Hunyuan Research, the paper examines the attention mechanisms of MLLMs which integrate visual capabilities alongside traditional text processing.

Key Discoveries and Methodologies

The authors uncover that only a limited subset—approximately less than 5%—of attention heads in an MLLM, referred to as visual heads, actively contribute to visual understanding, facilitating the processing of visual inputs. This discovery challenges existing assumptions about the uniform contribution of attention heads toward multimodal comprehension. The paper proposes an innovative training-free framework to efficiently identify these visual heads by quantifying head-level visual relevance, a process pivotal for optimizing computational resources.

SparseMM leverages the identified sparsity in visual heads by implementing a KV-Cache optimization strategy. This involves asymmetric allocation of computational budgets according to visual relevance scores, emphasizing the retention of visual semantics without compromising on efficiency across multimodal tasks. Comparative evaluations with existing KV-Cache methods demonstrate SparseMM’s ability to achieve superior accuracy-efficiency trade-offs, delivering notable performance gains such as 1.38× real-time acceleration and 52% reduction in memory usage during generation processes in multimodal LLMs.

Implications and Future Directions

From a practical standpoint, the research informs the design of more computationally efficient MLLMs, significantly enhancing performance scalability in resource-constrained environments. These implications are of particular relevance in the context of deploying MLLMs for real-time applications that involve extensive visual data processing, such as high-resolution image and video tasks.

Theoretically, the findings stimulate discourse on the nuanced roles of attention heads in multimodal learning, inviting further inquiry into optimal architecture designs that better harness the capabilities of visual heads. Future research can build upon SparseMM to investigate its application to other multimodal combinations beyond vision-language integration and explore adaptations for emerging AI models with even more complex input modalities.

In summary, the paper makes a compelling case for acknowledging the sparse distribution of attention heads dedicated to visual content processing, paving the way for more targeted approaches in optimizing MLLM performance. As AI systems increasingly engage with diverse data types and modalities, the insights provided by SparseMM can guide the advancement of next-generation models that withstand the demands of both computational intensity and multimodal comprehension.