Multi-Path Entropy Modules (MEM++)
- Multi-Path Entropy Modules (MEM) are probabilistic models that leverage channel-wise, local spatial, and global spatial contexts to effectively compress image latents.
- MEM++ partitions the latent space into slices and applies specialized modules (channel, checkerboard, and global attention) to achieve linear computational and memory scaling.
- Empirical studies show that MEM++ reduces BD-rate by up to -13.39% on the Kodak dataset, demonstrating state-of-the-art performance in high-resolution image coding.
Multi-Path Entropy Modules (MEM), with a focus on the linear complexity extension MEM, represent a class of entropy models for learned image compression that jointly exploit channel-wise, local spatial, and global spatial correlations in the latent representations generated by neural image compression architectures. MEM provides a structured multi-context probabilistic framework, partitioning the latent space and incorporating specialized modules for each context, all while ensuring linear computational and memory complexity with respect to image resolution. These characteristics make MEM highly suited for high-resolution image coding with state-of-the-art compression performance (Jiang et al., 2023).
1. Probabilistic Structure and Training Objective
The MEM model factorizes the joint likelihood of the quantized latent and quantized hyper-latent as follows: Here, denotes the number of channel-wise “slices” (typically 10), each comprising channels of the latent . The contexts are:
- : hyper-prior side information from
- : channel-wise context from previously decoded slices
- : local spatial context from checkerboard attention
- , : intra- and inter-slice global contexts
The training objective is the expected rate-distortion loss: with quantization replaced by additive uniform noise during training.
2. Channel-wise Context Path
MEM divides the latent into contiguous channel-wise slices . Each slice is processed sequentially. Channel-wise context is computed from all previously decoded slices using a sub-network built from three convolutions. This, along with the hyper-prior and spatial contexts, informs the entropy-parameter network to output Gaussian parameters for probabilistic modeling: The conditional pmf for slice factorizes as:
3. Local Spatial Context: Shifted-Window Checkerboard Attention
To efficiently capture local dependencies, MEM employs a checkerboard partition within each slice. Anchor (“ac”) positions are decoded first, followed by non-anchor (“na”) positions that attend locally to anchors using an overlapped, stride-1, window. The checkerboard attention module implements:
- Windowed attention using softmax over local neighborhoods, masked to enforce the checkerboard pattern
- Attention fusion via convolution aggregated across the image
- Final local context via a residual feed-forward network
The process ensures linear complexity: placement of windows (one per spatial position), each with computational cost , yields complexity with constant .
4. Global Spatial Context: Linear-Complexity Attention
Global correlations are modeled with two modules:
(a) Intra-slice Global Context
A softmax-decomposition trick reuses cross-attention computed on previous slices, exploiting near-invariant global correlation patterns. Instead of vanilla attention, the decomposition: reduces complexity to . Refinement is accomplished via conv and depth-wise bottleneck:
(b) Inter-slice Global Context
The same linear-factorization strategy generalizes to attention over all previously decoded slices, yielding inter-slice global context .
5. Computational and Memory Complexity
With , , and fixed:
- Channel-wise, local checkerboard, intra-slice and inter-slice global contexts each contribute overhead.
- GPU memory requirements scale linearly with image resolution .
- Empirical measurements demonstrate that at resolution, total peak RAM for MEM is approximately $5.6$ GB, whereas quadratic-attention predecessors can require more than $22$ GB.
The following table summarizes per-path computational scaling:
| Path | Complexity per position | Total Complexity |
|---|---|---|
| Channel-wise context | ||
| Local checkerboard attn | ||
| Intra/Inter global attn |
All paths remain linear in the number of spatial positions .
6. Empirical Results and Performance Gains
MEM achieves state-of-the-art rate–distortion performance in learned image compression. On the Kodak dataset, MLIC (leveraging MEM) reduces BD-rate by in PSNR compared to VTM-17.0 Intra and by in MS-SSIM, outperforming prior methods such as ELIC () and LIC-TCM (). PSNR improvements up to dB over Cheng’20 are observed across datasets (Kodak, Tecnick, CLIC Pro).
Ablation studies on Kodak indicate stepwise benefits from each context path:
- Channel-wise context: BD-rate
- Local checkerboard attention: further
- Intra-slice global:
- Inter-slice global:
The aggregate of all four yields the total gain. Encoding and decoding latencies scale smoothly with image resolution, substantiating practical suitability for high-resolution scenarios (Jiang et al., 2023).