Express Language Modeling

Published 9 Jun 2026 in cs.LG, cs.DS, math.ST, stat.ME, and stat.ML | (2606.10944v1)

Abstract: We introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering $\log^{3/2}(n)/s$ approximation error with only $O(s)$ memory and $O(s² \log^2(n))$ compression overhead for a sequence of length $n$. We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents Express, a meta-procedure that converts unmasked attention into causal approximations using Thinformer with provable error guarantees.
It introduces a custom I/O-aware Triton GPU kernel that accelerates long-context inference, achieving up to 82× speedup and reduced memory usage.
Empirical evaluations show significant improvements in prefill, KV cache compression, and long-form decoding compared to prior state-of-the-art methods.

Express Language Modeling: Efficient Causal Attention via Unmasked Approximation Conversion

Overview and Motivation

The paper "Express Language Modeling" (2606.10944) presents Express, a meta-procedure for converting non-causal (unmasked) attention approximations into causal (masked) approximations with strong accuracy and resource guarantees. Express is instantiated with the Thinformer approximation, yielding Thinformer Express: an approach to causal attention with provable error rates and substantially reduced memory and computational requirements. The significance of this work lies in enabling efficient and accurate long-context inference for LLMs, addressing bottlenecks in four canonical resource-constrained scenarios: prefill, KV cache compression, memory-limited, and compute-limited long-form decoding.

Theoretical Foundations

Sub-Quadratic Thinning and Attention Approximation

Recent advances provide sub-quadratic approximations for unmasked attention, often by retaining a concise, weighted subset (coreset) of key-value pairs using thinning algorithms. Such methods, notably Thinformer, employ sub-Gaussian thinning tailored to preserve attention outputs for all queries with $O(s)$ -sized coresets, where $s$ is the desired cache size. However, causal attention—the backbone of language modeling inference—requires masking and streaming updates, which complicates coreset maintenance and approximation guarantees.

The Express Meta-Procedure

Express generalizes thinning by recursively applying a "halving" algorithm, tailored for offline coresets, and introduces a streaming variant that enables cache updates in causal order. Express maintains a dynamically thinned cache with bounded size and incremental updatability while controlling error propagation in the streaming setting. Notably, the sub-Gaussian concentration of error only grows logarithmically with the input sequence length, and both memory and compute remain decoupled from $n$ (the total sequence length), scaling instead with the coreset size $s$ .

For a sequence of length $n$ , Express paired with Thinformer achieves an approximation error of $O(\log^{3/2}(n)/s)$ , $O(s)$ cache memory, and $O(s^2 \log^2(n))$ compression overhead. These resource profiles markedly improve the scalability of causal attention relative to prior state-of-the-art methods.

Methodological Implementation

Thinformer Express is implemented with a custom I/O-aware Triton GPU kernel, significantly accelerating runtime relative to both the previous PyTorch Thinformer and FlashAttention 2. The system exploits tiling for optimal memory access, batch-level and row-parallelism, and avoids explicit kernel matrix materialization via index-based dereferencing.

Express's key phases include:

Exact Phase: Initially, all input tokens are retained exactly up to cache size $s$ .
Thin Phase: Incoming tokens are batched and recursively thinned to $s$ elements.
Halve Phase: Once the cache exceeds $s$ 0, halving is used twice to reduce the cache, ensuring bounded memory.

This design guarantees that the attention cache never exceeds $s$ 1 entries, constraining GPU memory usage independently of context size.

Empirical Evaluation

Long-Context Prefill Acceleration

Express delivers up to $s$ 2 speedup over FlashAttention 2 at 512K tokens in the prefill phase with masked attention, outperforming both FlashAttention and HyperAttention in both runtime and runtime-perplexity tradeoffs in LongBench-E tasks.

Figure 1: Express achieves substantial speedups in long-context prefill, especially on masked attention tasks, with resource usage and accuracy superior to HyperAttention.

KV Cache Compression

Express seamlessly integrates as a drop-in replacement for conventional key-value computation in leading cache compression pipelines (SnapKV, StreamingLLM, PyramidKV). This reduces overall runtime for long-context language understanding benchmarks, with no degradation in downstream accuracy.

Figure 2: Across a spectrum of cache compression strategies, Express consistently accelerates attention without compromising accuracy across LongBench-E tasks.

Memory- and Compute-Efficient Long-Form Decoding

On multi-step MATH-500 mathematical reasoning tasks, Express allows models to operate with only 61% of the conventional cache memory while retaining exact-attention-level accuracy. It also reduces computational cost per token, matching accuracy with just 56% of the compute time, thereby outperforming contemporary alternatives (StreamingLLM, SnapKV, ExpectedAttention, KeyDiff, Knorm) in both memory and compute constrained regimes.

Figure 3: Express dominates state-of-the-art methods in both memory and compute efficiency on challenging long-form generation tasks, matching the accuracy of exact attention with reduced resources.

Comparison to Prior Work

Express's theoretical guarantees surpass those of BalanceKV and HyperAttention. Its error bounds are tighter in both magnitude and dependence on context size, value magnitudes, and the error inflation factor $s$ 3. While HyperAttention's error decays as $s$ 4 with $s$ 5 compute, Express decays as $s$ 6 for the same computational budget. Moreover, Express achieves a better value-matrix dependence and utilizes less peak memory and compute for similar or better accuracy guarantees.

Implications and Future Developments

Practical Impact

Express presents a robust framework for integrating high-quality unmasked attention approximations into the causal inference pipeline of LLMs without loss of theoretical guarantees or practical efficiency. This unblocks the use of advanced thinning methods in production-scale LLM deployments, especially for extremely long contexts on resource-constrained devices. Applications include efficient chat assistants, document processing, mathematical reasoning systems, and stream-processing LLM scenarios.

Theoretical and Methodological Extensions

Express is agnostic to the base halving algorithm: any new, higher-quality thinning method can potentially be plugged in to realize further efficiency or accuracy improvements. Further, the analytical results open avenues for hybrid compression strategies and compressed memory architectures leveraging causal streaming coresets in transformer inference.

Future developments may involve:

Exploring alternative base thinning algorithms and kernels tailored for specific value distributions or data modalities.
Extending the Triton implementation to support emerging GPU features (e.g., FP8, persistent kernels, TMAs), which were highlighted as promising by the authors.
Broadening empirical evaluation to non-English languages, multimodal LLMs, and diverse domain tasks.

Conclusion

Express enables practical, theoretically principled, and resource-efficient causal attention approximation using sub-quadratic unmasked methods. Through the Thinformer Express instantiation, the approach demonstrates strictly improved performance and guarantees over existing work in multiple critical LLM inference regimes. The modular design, strong error control, and efficient engineering provide a foundation for scalable, deployable attention at unprecedented context lengths.

Markdown Report Issue