Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 103 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 27 tok/s

GPT-5 High 37 tok/s Pro

GPT-4o 92 tok/s

GPT OSS 120B 467 tok/s Pro

Kimi K2 241 tok/s Pro

2000 character limit reached

Representation Shift: Unifying Token Compression with FlashAttention (2508.00367v1)

Published 1 Aug 2025 in cs.CV

Abstract: Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.

Collections

Summary

The paper introduces Representation Shift to quantify token importance based on the change in token representation, enabling training-free and model-agnostic token compression.
Integrating with FlashAttention and architectures like CNNs and SSMs, the method achieves up to a 5.5× speedup on video-text retrieval and improved vision performance.
Robust experimental and qualitative analyses confirm that Representation Shift effectively identifies salient tokens, enhancing computational efficiency in vision tasks.

Representation Shift for Efficient Token Compression

The paper "Representation Shift: Unifying Token Compression with FlashAttention" (2508.00367) introduces Representation Shift, a novel metric for token importance in neural networks. This metric quantifies the change in a token's representation as it passes through a layer, enabling training-free and model-agnostic token compression. The approach is compatible with FlashAttention and generalizes to CNNs and SSMs, offering significant speedups in various vision tasks.

Background and Motivation

Transformers have become prevalent in vision tasks, but their quadratic complexity limits scalability. Token compression techniques reduce computational cost by pruning or merging less informative tokens. FlashAttention optimizes GPU memory access, but is incompatible with attention-map-dependent token compression methods. This paper addresses these limitations by introducing a token importance criterion based on representation shift, applicable across different architectures and compatible with FlashAttention.

Method: Representation Shift

Representation shift, denoted as $\Delta \mathbf{x}$ , is defined as the distance between the input and output representations of a token after transformation by a layer:

$\Delta \mathbf{x} = \mathcal{D} (F(\mathbf{x}), \mathbf{x})$

where $F(\cdot)$ is the layer's transformation (e.g., Attention or MLP), and $\mathcal{D}$ is a distance metric like the L2 norm. The hypothesis is that critical tokens exhibit larger representation shifts. (Figure 1) illustrates the concept.

Figure 2: Comparison of importance metrics for token pruning (average over 7 video-text retrieval benchmarks in \Cref{table:retrieval}). Pruning with a conventional attention-based score (Attn) yields poor speed-accuracy trade-offs on UMT-L and is incompatible with FlashAttention (FA). In contrast, our proposed representation shift accelerates both vanilla UMT-L and UMT-L with FlashAttention, achieving superior trade-offs compared to downscaling to UMT-B and attention-based scores.

Implementation Details and Ablations

The authors explored different operation choices for computing representation shift, including attention layers, MLPs, and entire attention blocks. They found that using the MLP layer yielded the best results. Various distance metrics, such as L1 norm, L2 norm, and cosine distance, were also evaluated, with the L2 norm performing most robustly. For Vision Transformers, the attention blocks are computed as:

$\mathbf{x}^\prime= \text{SA}(\text{LN}(\mathbf{x})) + \mathbf{x}$ \hat{\mathbf{x} = \text{MLP}(\text{LN}(\mathbf{x}^\prime)) + \mathbf{x}^\prime $</p> <p>where LN is Layer Normalization. (Figure 3) visualizes the effects of these different choices. <img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2508-00367/MAIN_MLP.png" alt="Figure 1" title="" class="markdown-image" loading="lazy"> <p class="figure-caption">Figure 1: Illustration of representation shift for token importance. We compute the L2 distance between token representations before and after the MLP layer to quantify how much each token is emphasized by the transformation.</p></p> <h2 class='paper-heading' id='experimental-results'>Experimental Results</h2> <p>Extensive experiments were conducted on video and image understanding tasks. On video-text retrieval with UMT, representation shift achieved a$ 5.5\times$ speedup. (Figure 4) visualizes representation shift in image tokens. The method also demonstrated strong performance on video question-answering tasks. For image classification, representation shift consistently outperformed attention-based scores with DeiT models. Furthermore, the approach was successfully extended to ResNet CNNs and Vision Mamba SSMs, demonstrating its model-agnostic nature.

Figure 5: DeiT-S

Qualitative Analysis and Reliability

Qualitative analysis revealed that representation shift effectively captures foreground objects and salient regions, aligning with saliency detection concepts. Reliability analysis through extreme pruning experiments confirmed the robustness of representation shift as an importance metric. (Figure 6) provides a qualitative comparison between attention scores and representation shift.

Figure 4: Visualization of representation shift. Given the image (left), we visualize (right) the representation shift ( $\Delta\mathbf {x}$ ) of each token before and after the attention layer.

Conclusion

The Representation Shift offers a training-free, model-agnostic approach to token importance estimation. Its compatibility with FlashAttention and generalizability to various architectures make it a versatile solution for enhancing the efficiency of vision models. Qualitative and quantitative results highlight its potential as an improved token importance criterion for efficient token compression.

Figure 3: Operation choice

Figure 6: Qualitative comparison between attention scores (Attn) and representation shift (Ours). Given each sample, we visualize (a) the attention scores with respect to the class token and (b) representation shift in the [1,5,9] layers of the DeiT-B~\cite{touvron2021training}

Figure 7: Visualization of representation shift in ResNet-50~\cite{he2016deep}

PDF Markdown

Paper Prompts

Follow-up Questions

Authors (6)