Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Path Entropy Modules (MEM++)

Updated 6 February 2026
  • Multi-Path Entropy Modules (MEM) are probabilistic models that leverage channel-wise, local spatial, and global spatial contexts to effectively compress image latents.
  • MEM++ partitions the latent space into slices and applies specialized modules (channel, checkerboard, and global attention) to achieve linear computational and memory scaling.
  • Empirical studies show that MEM++ reduces BD-rate by up to -13.39% on the Kodak dataset, demonstrating state-of-the-art performance in high-resolution image coding.

Multi-Path Entropy Modules (MEM), with a focus on the linear complexity extension MEM++^{++}, represent a class of entropy models for learned image compression that jointly exploit channel-wise, local spatial, and global spatial correlations in the latent representations generated by neural image compression architectures. MEM++^{++} provides a structured multi-context probabilistic framework, partitioning the latent space and incorporating specialized modules for each context, all while ensuring linear computational and memory complexity with respect to image resolution. These characteristics make MEM++^{++} highly suited for high-resolution image coding with state-of-the-art compression performance (Jiang et al., 2023).

1. Probabilistic Structure and Training Objective

The MEM++^{++} model factorizes the joint likelihood p(y^,z^)p(\hat y, \hat z) of the quantized latent y^\hat y and quantized hyper-latent z^\hat z as follows: p(y^,z^)=p(z^)i=1Lp(y^aciΦh,Φchi,Φgc,interi)p(y^naiΦh,Φchi,Φlci,Φgc,intrai,Φgc,interi)p(\hat{y}, \hat{z}) = p(\hat{z}) \cdot \prod_{i=1}^{L} p\bigl(\hat{y}^i_{\mathrm{ac}}\,|\, \Phi_h,\, \Phi_{ch}^i,\, \Phi_{gc,\mathrm{inter}}^i\bigr)\cdot p\bigl(\hat{y}^i_{\mathrm{na}}\,|\,\Phi_h,\,\Phi_{ch}^i,\,\Phi_{lc}^i,\,\Phi_{gc,\mathrm{intra}}^i,\,\Phi_{gc,\mathrm{inter}}^i\bigr) Here, LL denotes the number of channel-wise “slices” (typically 10), each comprising SS channels of the latent yRH×W×My\in \mathbb{R}^{H\times W\times M}. The contexts are:

  • Φh\Phi_h: hyper-prior side information from z^\hat z
  • Φchi\Phi_{ch}^i: channel-wise context from previously decoded slices y^<i\hat y^{<i}
  • Φlci\Phi_{lc}^i: local spatial context from checkerboard attention
  • Φgc,intrai\Phi_{gc,\mathrm{intra}}^i, Φgc,interi\Phi_{gc,\mathrm{inter}}^i: intra- and inter-slice global contexts

The training objective is the expected rate-distortion loss: L=Ex,u[logp(z^)i=1Llogp(yaci+u)i=1Llogp(ynai+u)]+λD(x,x^)\mathcal{L} = \mathbb{E}_{x,u} \bigg[ -\log p(\hat z) -\sum_{i=1}^L \log p(y^i_{\mathrm{ac}}+u|\cdot) -\sum_{i=1}^L \log p(y^i_{\mathrm{na}}+u|\cdot) \bigg] + \lambda D(x, \hat x) with quantization replaced by additive uniform noise uU(12,12)u\sim \mathcal{U}(-\frac{1}{2},\frac{1}{2}) during training.

2. Channel-wise Context Path

MEM++^{++} divides the latent yy into LL contiguous channel-wise slices y1,,yLy^1, \dots, y^L. Each slice ii is processed sequentially. Channel-wise context Φchi\Phi_{ch}^i is computed from all previously decoded slices (y^<i)(\hat y^{<i}) using a sub-network gchg_{ch} built from three 3×33\times 3 convolutions. This, along with the hyper-prior and spatial contexts, informs the entropy-parameter network gepg_{ep} to output Gaussian parameters [μi,σi][\mu^i, \sigma^i] for probabilistic modeling: [μi,σi]=gep(Φh,Φchi,other contexts)[\mu^i, \sigma^i] = g_{ep}(\Phi_h, \Phi_{ch}^i, \text{other contexts}) The conditional pmf for slice ii factorizes as: p(y^iy^<i,)=p=1HWSN(y^pi;  μpi,(σpi)2)U(12,12)p(\hat y^i\,|\, \hat y^{<i}, \ldots) = \prod_{p=1}^{HWS} \mathcal{N}\bigl(\hat y^i_p;\; \mu^i_p, (\sigma^i_p)^2\bigr) * \mathcal{U}(-\tfrac{1}{2},\tfrac{1}{2})

3. Local Spatial Context: Shifted-Window Checkerboard Attention

To efficiently capture local dependencies, MEM++^{++} employs a checkerboard partition within each slice. Anchor (“ac”) positions are decoded first, followed by non-anchor (“na”) positions that attend locally to anchors using an overlapped, stride-1, K×KK\times K window. The checkerboard attention module implements:

  • Windowed attention using softmax over local K2×K2K^2 \times K^2 neighborhoods, masked to enforce the checkerboard pattern
  • Attention fusion via K×KK\times K convolution aggregated across the image
  • Final local context Φlci\Phi_{lc}^i via a residual feed-forward network

The process ensures linear complexity: placement of LL windows (one per spatial position), each with computational cost O(K4S)O(K^4 S), yields O(L)O(L) complexity with constant K,SK,S.

4. Global Spatial Context: Linear-Complexity Attention

Global correlations are modeled with two modules:

(a) Intra-slice Global Context

A softmax-decomposition trick reuses cross-attention computed on previous slices, exploiting near-invariant global correlation patterns. Instead of O(L2)O(L^2) vanilla attention, the decomposition: Alin=softmax2(Qnai1)[softmax1(Kaci1)]VaciA_{\rm lin} = \mathrm{softmax}_2(Q^{i-1}_{na}) \bigl[ \mathrm{softmax}_1(K^{i-1}_{ac}) \bigr]^\top V^i_{ac} reduces complexity to O(L)O(L). Refinement is accomplished via K×KK\times K conv and depth-wise bottleneck: Φgc,intrai=DepthRB(convK×K(Alin))\Phi_{gc,\mathrm{intra}}^i = \mathrm{DepthRB}\bigl(\mathrm{conv}_{K\times K}(A_{\rm lin})\bigr)

(b) Inter-slice Global Context

The same linear-factorization strategy generalizes to attention over all previously decoded slices, yielding inter-slice global context Φgc,interi\Phi_{gc,\mathrm{inter}}^i.

5. Computational and Memory Complexity

With L=HWL=H\cdot W, K=5K=5, and S=32S=32 fixed:

  • Channel-wise, local checkerboard, intra-slice and inter-slice global contexts each contribute O(L)O(L) overhead.
  • GPU memory requirements scale linearly with image resolution H×WH\times W.
  • Empirical measurements demonstrate that at 2048×20482048\times 2048 resolution, total peak RAM for MEM++^{++} is approximately $5.6$ GB, whereas quadratic-attention predecessors can require more than $22$ GB.

The following table summarizes per-path computational scaling:

Path Complexity per position Total Complexity
Channel-wise context O(1)O(1) O(L)O(L)
Local checkerboard attn O(K4S)O(K^4 S) O(L)O(L)
Intra/Inter global attn O(S)O(S) O(L)O(L)

All paths remain linear in the number of spatial positions LL.

6. Empirical Results and Performance Gains

MEM++^{++} achieves state-of-the-art rate–distortion performance in learned image compression. On the Kodak dataset, MLIC++^{++} (leveraging MEM++^{++}) reduces BD-rate by 13.39%-13.39\% in PSNR compared to VTM-17.0 Intra and by 53.63%-53.63\% in MS-SSIM, outperforming prior methods such as ELIC (5.95%-5.95\%) and LIC-TCM (10.14%-10.14\%). PSNR improvements up to +0.8+0.8 dB over Cheng’20 are observed across datasets (Kodak, Tecnick, CLIC Pro).

Ablation studies on Kodak indicate stepwise benefits from each context path:

  • Channel-wise context: 4.89%-4.89\% BD-rate
  • Local checkerboard attention: further 2.32%-2.32\%
  • Intra-slice global: 4.37%-4.37\%
  • Inter-slice global: 2.79%-2.79\%

The aggregate of all four yields the total 13.39%-13.39\% gain. Encoding and decoding latencies scale smoothly with image resolution, substantiating practical suitability for high-resolution scenarios (Jiang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Path Entropy Modules (MEM).