Attention-Based Transform Coding Model

Updated 7 December 2025

The topic introduces a learned compression model that replaces or augments traditional convolutional transforms with neural attention to capture both local and global dependencies.
It employs advanced entropy models and attention mechanisms, achieving significant bitrate reductions and improved perceptual quality over conventional codecs.
Key architectures demonstrate linear scaling and efficient compression by integrating variants like Bi-RWKV, cross-attention, and DCT-based modules for computational gains.

An attention-based transform coding model is a class of learned compression architecture in which the image-to-latent and latent-to-image transforms, as well as contextual entropy models, employ neural attention mechanisms—typically variants of self-attention or cross-attention—in place of or alongside convolutional or autoregressive operators. These architectures exploit attention’s ability to capture long-range or structured dependencies, yielding improved coding efficiency, better semantic fidelity, and, in some designs, linear (rather than quadratic) scaling with input size. Recent research demonstrates that attention-based transform coding models enable substantial bitrate reductions over classical and prior learned codecs, and comprise a key frontier in practical, high-performance image and multimodal compression.

1. Core Principles of Attention-Based Transform Coding

The defining characteristic of attention-based transform coding models is the replacement or augmentation of classical convolutional transforms with neural attention blocks in the analysis, synthesis, and entropy modeling stages. In vanilla (non-attention) learned compression, the core structure is a nonlinear transform autoencoder: input images are encoded by a learnable analysis transform ( $g_a$ ) to a latent representation, quantized, entropy-coded, and then reconstructed by an overview transform ( $g_s$ ). Attention-based coding models modify this pipeline as follows:

Analysis and Synthesis via Attention: The transforms $g_a$ and $g_s$ , as well as the hyperprior branches ( $h_a$ , $h_s$ ), may use stacks of attention modules (multi-head self-attention, cross-attention, or windowed/channelized attention) instead of, or in addition to, convolutional layers (Feng et al., 9 Feb 2025, Luka et al., 2023, Mortaheb et al., 2 Dec 2024, Xu et al., 21 Sep 2024, Mudgal et al., 28 Oct 2024).
Contextual Entropy Modeling: The probability models used for entropy coding (e.g., Gaussian or mixture prior over quantized latents) are made context- or autoregressively dependent using attention for improved rate estimation (Feng et al., 9 Feb 2025, Xu et al., 21 Sep 2024).
Transform Coding with Classical Bases: Some designs exploit classical transforms such as DCT within attention modules, integrating frequency-domain decorrelation into neural attention layers for parameter-efficient and complexity-reduced compression (Pan et al., 22 May 2024).

The shift to attention-based coding is motivated by the need to model both global (long-range, semantic) and local (texture, detail) dependencies—overcoming the limited receptive field or inductive bias of convolution alone.

2. Architectural Realizations

Several major architectures exemplify state-of-the-art attention-based transform coding:

Model	Transform Mechanism	Entropy Model	Key Attention Mechanism
LALIC (Feng et al., 9 Feb 2025)	Bi-RWKV blocks + Conv/Shift	RWKV-SCCTX (sp-ch ctx)	Linear WKV, Bi-directional
QPressFormer (Luka et al., 2023)	Transformer-only (cross-attention)	MLP factorized, LPIPS loss	Full multi-head cross-attn
SCH (Xu et al., 21 Sep 2024)	Wavelet-DWT + SCH blocks	Channel AR Gaussian	Windowed channel/sp. attn
CWAM (Mudgal et al., 28 Oct 2024)	Conv+GDN + Cross-Window Attn	Hyperprior, AR model	Cross-scale window attn.
DCT-Attn (Pan et al., 22 May 2024)	ViT/Swin Attention, DCT Transform	N/A	Channel-wise DCT in MSA

Bi-RWKV Linear Attention (LALIC):

LALIC introduces Bi-RWKV blocks throughout all transforms, leveraging parallel spatial-mix and channel-mix branches. Each block linearly projects the normalized input, then applies either a bi-directional weighted key-value (WKV) attention (in space) or a gated ReLU squared module (in channel). The Bi-RWKV mechanism enables $\mathcal{O}(LD)$ time and memory, compared to $\mathcal{O}(L^2D)$ for standard full self-attention, where $L$ is sequence length and $D$ feature dimension. A convolutional "Omni-Shift" layer serves as a lightweight 2D mixing operator (Feng et al., 9 Feb 2025).

QPressFormer:

QPressFormer dispenses with convolution entirely, implementing both encoder and decoder as transformers with pure attention—specifically cross-attention between learned queries and patch embeddings. The latent codes themselves are attention-aggregated representations (learned queries) with fully factorized entropy coding. This demonstrates that fully-attentional architectures can match or surpass convolutional codecs on perceptual metrics (Luka et al., 2023).

Space-Channel Hybrid (SCH) and Cross-Window Modules:

SCH leverages Haar-wavelet frequency decomposition, then interleaves window-based spatial and channel attention. The SCH block alternates between local spatial and global channel attention in windows, utilizing non-overlapping partitioning for efficiency. Window-based channel attention dramatically increases receptive field, enabling SOTA BD-rate improvements over VVC and prior learned codecs (Xu et al., 21 Sep 2024). Similarly, Cross-Window Attention Modules (CWAM) correlate fine-scale and downsampled feature windows, exploiting both global context and local redundancy (Mudgal et al., 28 Oct 2024).

DCT-Based Attention Compression:

Transform-coding principles are embedded directly into the attention operation by projecting input features (or weight matrices) onto a truncated DCT basis, discarding high-frequency (typically noise-dominated) coefficients. This reduces parameter count and FLOPs with minimal or no loss in accuracy for vision transformers (Pan et al., 22 May 2024). This technique is orthogonal to global/structured attention, offering complexity reduction.

3. Mathematical Formulation and Module Design

Bi-RWKV Block (LALIC)

For input $f\in\mathbb{R}^{H\times W\times C}$ reshaped as $X\in\mathbb{R}^{T\times C}$ :

Spatial-Mix computes

$\mathrm{wkv}_t = \frac{ \sum_{i=1}^{T} \exp\!\bigl(-\frac{|t-i|-1}{T}\,w + k_i\bigr)\,v_i + \exp(u + k_t)\,v_t } { \sum_{i=1}^{T} \exp\!\bigl(-\frac{|t-i|-1}{T}\,w + k_i\bigr) + \exp(u + k_t) }$

followed by sigmoid gating and addition to the input.

Channel-Mix performs a squared-ReLU on projected keys and a similar gated fusion.

Omni-Shift Layer:

Implements 2D depthwise $5\times5$ convolution to translate sequence-based mixing to 2D feature maps, merged at inference for efficiency.

Attention Coding in QPressFormer

Encoder and decoder learn sets of queries ( $Q_E^{in}, Q_D^{in}$ ), which interact with patch embeddings via layers of self- and cross-attention: $Q^{n+\frac{1}{2}} = \mathrm{CA}(\mathrm{SA}[Q^n], \phi(I)) \qquad Q^{n+1} = Q^{n+\frac{1}{2}} + \mathrm{FFN}^n(Q^{n+\frac{1}{2}})$ The outputs are quantized and coded with a factorized entropy model. Training optimizes a LPIPS-weighted rate–distortion loss.

Windowed and Cross-Scale Attention

Space-Channel Hybrid blocks interleave within the pipeline:

Stage I: Splits input via $1\times1$ conv into channel branches, applies windowed spatial attention to one, CNN residual to the other, then merges.
Stage II: As above, but with channel attention replacing spatial attention.

Cross-Window modules in (Mudgal et al., 28 Oct 2024) partition feature maps and corresponding downsampled features, applying multi-head self-attention per window, with keys and values computed from the coarse-scale features.

DCT-Decorrelated Attention

Vision transformer attention blocks include a projection onto a truncated DCT basis: $\widetilde{X} = X\,\bar{\mathcal{D}}^T$ Reduced $\tau C$ channels are linearly projected, and after attention, inverse DCT (with zero-padding) reconstructs the output. This yields strict parameter and FLOP reductions (e.g., 13% fewer parameters at $\tau=0.75$ ) (Pan et al., 22 May 2024).

4. Entropy Models and Rate–Distortion Training

Attention-based transform coding models employ advanced entropy models that leverage spatial and/or channel context:

RWKV-SCCTX (LALIC):

For each spatial position and channel slice, context is aggregated from previously decoded neighbors (via checkerboard masking and Bi-RWKV blocks), plus side information from a hyperprior, producing location-specific Gaussian parameters for arithmetic coding (Feng et al., 9 Feb 2025).

SCH and CWAM:

Auto-regressive channel slicing and window- or scale-aware attention drive conditional Gaussian priors, exploiting inter- and intra-channel dependencies (Xu et al., 21 Sep 2024, Mudgal et al., 28 Oct 2024).

QPressFormer:

Employs a factorized prior ( $p_\theta$ is independent per latent) but uses a learned MLP for entropy modeling (Luka et al., 2023).

The optimization targets a Lagrangian rate–distortion objective, with perceptual metrics (LPIPS, MS-SSIM) frequently replacing MSE for higher-quality reconstructions at low bitrates.

5. Computational Complexity and Scaling

A central concern is the scaling of attention mechanisms:

Quadratic vs. Linear Complexity:

Standard self-attention has $\mathcal{O}(L^2D)$ cost, prohibitive for high-resolution images (large $L$ ). Bi-RWKV attention yields $\mathcal{O}(LD)$ complexity, enabling global receptive field at modest cost: Bi-RWKV+Shift requires $79LD$ FLOPs per block, while comparable windowed self-attention is $128LD$ (Feng et al., 9 Feb 2025).

DCT-Based Compression:

DCT-truncated attention further reduces $C^3$ terms in both parameters and compute (for $C$ channels, with keep ratio $\tau$ ):

$(2\tau + 3\tau^2) N M^2 C^3 + 2\tau^2 N M^4 C^2$

This yields direct resource savings as a function of $\tau$ (Pan et al., 22 May 2024).

Empirical results validate that these techniques maintain state-of-the-art (SOTA) or superior performance for PSNR, LPIPS, and BD-rate metrics while exhibiting favorable latency and memory footprints.

6. Empirical Performance and Benchmark Comparisons

Attention-based transform coding architectures have established new benchmarks in learned image compression:

LALIC achieves BD-rate improvements vs. VTM-9.1 of –14.84% (Kodak), –15.20% (CLIC), –17.32% (Tecnick), with encoding/decoding latencies under 0.3s and GPU memory of 0.84 GB (Feng et al., 9 Feb 2025).
QPressFormer achieves a 20–30% relative LPIPS reduction over strong convolutional baselines and >2× FID improvement at 0.3bpp, albeit with lower PSNR, reflecting a perceptual coding tradeoff (Luka et al., 2023).
SCH outperforms VVC (VTM-23.1) by up to –24.71% in BD-rate on CLIC-Test and –18.54% on Kodak, and performs superior to TCM and other hybrid models (Xu et al., 21 Sep 2024).
Cross-Window Attention (CWAM) yields about 5–15% bitrate savings over strong baselines, with ablations confirming the independent contributions of feature encoding and cross-scale attention (Mudgal et al., 28 Oct 2024).
DCT-based attention compression enables up to 18% parameter and 8% FLOP reduction on ImageNet/ViT, maintaining or even marginally improving classification accuracy (Pan et al., 22 May 2024).

7. Extensions, Open Problems, and Future Directions

While attention-based transform coding models now surpass classical codecs and most convolutional learned methods for still images, several directions remain for further research:

Scalability to Ultra-High Resolutions:

Linear or windowed attention and hybrid transform techniques are critical for manageable scaling to $8$K and beyond; linear-complexity or sparsity-aware mechanisms will likely be required as datasets and application domains expand (Feng et al., 9 Feb 2025, Xu et al., 21 Sep 2024).

Frequency-Domain and Adaptivity:

Combining learned attention with DCT, wavelet, or other analytic transforms offers interpretable complexity gains and the potential for runtime adaptivity and hardware optimization (Pan et al., 22 May 2024, Xu et al., 21 Sep 2024).

Generalization to Video and Multimodal:

Extending attention-based coding to temporal and multimodal (audio/video/text) regimes will be driven by hierarchical attention, causal masking, and efficient entropy models, paralleling recent progress in video codecs and representation learning.

Semantic and Distributed/Embedded Coding:

Exploiting transformer attention for semantic communication over variable bandwidth (e.g., adaptive patch-wise coding driven by semantic attention masks) demonstrates promise for adaptive multimedia systems and edge-device deployments (Mortaheb et al., 2 Dec 2024).

The field continues to develop rapidly, with interdisciplinary advances at the intersection of neural compression, efficient attention, signal processing, and information theory.