Multi-Path Entropy Modules (MEM++)

Updated 6 February 2026

Multi-Path Entropy Modules (MEM) are probabilistic models that leverage channel-wise, local spatial, and global spatial contexts to effectively compress image latents.
MEM++ partitions the latent space into slices and applies specialized modules (channel, checkerboard, and global attention) to achieve linear computational and memory scaling.
Empirical studies show that MEM++ reduces BD-rate by up to -13.39% on the Kodak dataset, demonstrating state-of-the-art performance in high-resolution image coding.

Multi-Path Entropy Modules (MEM), with a focus on the linear complexity extension MEM $^{++}$ , represent a class of entropy models for learned image compression that jointly exploit channel-wise, local spatial, and global spatial correlations in the latent representations generated by neural image compression architectures. MEM $^{++}$ provides a structured multi-context probabilistic framework, partitioning the latent space and incorporating specialized modules for each context, all while ensuring linear computational and memory complexity with respect to image resolution. These characteristics make MEM $^{++}$ highly suited for high-resolution image coding with state-of-the-art compression performance (Jiang et al., 2023).

1. Probabilistic Structure and Training Objective

The MEM $^{++}$ model factorizes the joint likelihood $p(\hat y, \hat z)$ of the quantized latent $\hat y$ and quantized hyper-latent $\hat z$ as follows: $p(\hat{y}, \hat{z}) = p(\hat{z}) \cdot \prod_{i=1}^{L} p\bigl(\hat{y}^i_{\mathrm{ac}}\,|\, \Phi_h,\, \Phi_{ch}^i,\, \Phi_{gc,\mathrm{inter}}^i\bigr)\cdot p\bigl(\hat{y}^i_{\mathrm{na}}\,|\,\Phi_h,\,\Phi_{ch}^i,\,\Phi_{lc}^i,\,\Phi_{gc,\mathrm{intra}}^i,\,\Phi_{gc,\mathrm{inter}}^i\bigr)$ Here, $L$ denotes the number of channel-wise “slices” (typically 10), each comprising $S$ channels of the latent $y\in \mathbb{R}^{H\times W\times M}$ . The contexts are:

$\Phi_h$ : hyper-prior side information from $\hat z$
$\Phi_{ch}^i$ : channel-wise context from previously decoded slices $\hat y^{<i}$
$\Phi_{lc}^i$ : local spatial context from checkerboard attention
$\Phi_{gc,\mathrm{intra}}^i$ , $\Phi_{gc,\mathrm{inter}}^i$ : intra- and inter-slice global contexts

The training objective is the expected rate-distortion loss: $\mathcal{L} = \mathbb{E}_{x,u} \bigg[ -\log p(\hat z) -\sum_{i=1}^L \log p(y^i_{\mathrm{ac}}+u|\cdot) -\sum_{i=1}^L \log p(y^i_{\mathrm{na}}+u|\cdot) \bigg] + \lambda D(x, \hat x)$ with quantization replaced by additive uniform noise $u\sim \mathcal{U}(-\frac{1}{2},\frac{1}{2})$ during training.

2. Channel-wise Context Path

MEM $^{++}$ divides the latent $y$ into $L$ contiguous channel-wise slices $y^1, \dots, y^L$ . Each slice $i$ is processed sequentially. Channel-wise context $\Phi_{ch}^i$ is computed from all previously decoded slices $(\hat y^{<i})$ using a sub-network $g_{ch}$ built from three $3\times 3$ convolutions. This, along with the hyper-prior and spatial contexts, informs the entropy-parameter network $g_{ep}$ to output Gaussian parameters $[\mu^i, \sigma^i]$ for probabilistic modeling: $[\mu^i, \sigma^i] = g_{ep}(\Phi_h, \Phi_{ch}^i, \text{other contexts})$ The conditional pmf for slice $i$ factorizes as: $p(\hat y^i\,|\, \hat y^{<i}, \ldots) = \prod_{p=1}^{HWS} \mathcal{N}\bigl(\hat y^i_p;\; \mu^i_p, (\sigma^i_p)^2\bigr) * \mathcal{U}(-\tfrac{1}{2},\tfrac{1}{2})$

3. Local Spatial Context: Shifted-Window Checkerboard Attention

To efficiently capture local dependencies, MEM $^{++}$ employs a checkerboard partition within each slice. Anchor (“ac”) positions are decoded first, followed by non-anchor (“na”) positions that attend locally to anchors using an overlapped, stride-1, $K\times K$ window. The checkerboard attention module implements:

Windowed attention using softmax over local $K^2 \times K^2$ neighborhoods, masked to enforce the checkerboard pattern
Attention fusion via $K\times K$ convolution aggregated across the image
Final local context $\Phi_{lc}^i$ via a residual feed-forward network

The process ensures linear complexity: placement of $L$ windows (one per spatial position), each with computational cost $O(K^4 S)$ , yields $O(L)$ complexity with constant $K,S$ .

4. Global Spatial Context: Linear-Complexity Attention

Global correlations are modeled with two modules:

(a) Intra-slice Global Context

A softmax-decomposition trick reuses cross-attention computed on previous slices, exploiting near-invariant global correlation patterns. Instead of $O(L^2)$ vanilla attention, the decomposition: $A_{\rm lin} = \mathrm{softmax}_2(Q^{i-1}_{na}) \bigl[ \mathrm{softmax}_1(K^{i-1}_{ac}) \bigr]^\top V^i_{ac}$ reduces complexity to $O(L)$ . Refinement is accomplished via $K\times K$ conv and depth-wise bottleneck: $\Phi_{gc,\mathrm{intra}}^i = \mathrm{DepthRB}\bigl(\mathrm{conv}_{K\times K}(A_{\rm lin})\bigr)$

(b) Inter-slice Global Context

The same linear-factorization strategy generalizes to attention over all previously decoded slices, yielding inter-slice global context $\Phi_{gc,\mathrm{inter}}^i$ .

5. Computational and Memory Complexity

With $L=H\cdot W$ , $K=5$ , and $S=32$ fixed:

Channel-wise, local checkerboard, intra-slice and inter-slice global contexts each contribute $O(L)$ overhead.
GPU memory requirements scale linearly with image resolution $H\times W$ .
Empirical measurements demonstrate that at $2048\times 2048$ resolution, total peak RAM for MEM $^{++}$ is approximately $5.6$ GB, whereas quadratic-attention predecessors can require more than $22$ GB.

The following table summarizes per-path computational scaling:

Path	Complexity per position	Total Complexity
Channel-wise context	$O(1)$	$O(L)$
Local checkerboard attn	$O(K^4 S)$	$O(L)$
Intra/Inter global attn	$O(S)$	$O(L)$

All paths remain linear in the number of spatial positions $L$ .

6. Empirical Results and Performance Gains

MEM $^{++}$ achieves state-of-the-art rate–distortion performance in learned image compression. On the Kodak dataset, MLIC $^{++}$ (leveraging MEM $^{++}$ ) reduces BD-rate by $-13.39\%$ in PSNR compared to VTM-17.0 Intra and by $-53.63\%$ in MS-SSIM, outperforming prior methods such as ELIC ( $-5.95\%$ ) and LIC-TCM ( $-10.14\%$ ). PSNR improvements up to $+0.8$ dB over Cheng’20 are observed across datasets (Kodak, Tecnick, CLIC Pro).

Ablation studies on Kodak indicate stepwise benefits from each context path:

Channel-wise context: $-4.89\%$ BD-rate
Local checkerboard attention: further $-2.32\%$
Intra-slice global: $-4.37\%$
Inter-slice global: $-2.79\%$

The aggregate of all four yields the total $-13.39\%$ gain. Encoding and decoding latencies scale smoothly with image resolution, substantiating practical suitability for high-resolution scenarios (Jiang et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned Image Compression (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Path Entropy Modules (MEM).