Diffusion-Based Image Compression
- Diffusion-based image compression is a class of generative codecs that leverages iterative denoising and latent transforms to reconstruct high-quality images from compact representations.
- These systems integrate analysis transforms, quantization, and entropy coding to balance ultra-low bitrate with strong perceptual realism.
- Innovations like single-step decoders and semantic guidance techniques significantly reduce inference latency while maintaining fidelity.
A diffusion-based image compression system is a class of learned generative codecs that exploit denoising diffusion probabilistic models (DDPMs) or their latent variants to achieve extremely low bitrate coding with strong perceptual realism and/or fidelity. These systems integrate analysis transforms, quantization, entropy coding, and (typically conditional) diffusion-based decoding, enabling efficient mapping from raw images to highly compact bitstreams and accurate, visually plausible, or even semantic reconstructions from these representations.
1. Core Principles and Generative Foundations
Diffusion-based image compression (DBIC) relies on the ability of diffusion models to learn complex image priors via iterative denoising from a tractable noise distribution, typically Gaussian. Compression systems in this class are built by (i) transforming an image into a latent space using analysis transforms (e.g., learned encoders, VAEs, or feature extractors), (ii) quantizing and entropy-encoding these latents to obtain a bitstream, and (iii) reconstructing either the latent or image domain at the decoder via a (conditional) diffusion model, often enhanced with further processors or guidance mechanisms (Li et al., 3 Oct 2024, Guo et al., 8 Oct 2024, Park et al., 19 Jun 2025, Zhang et al., 27 Jun 2025).
Formally, in canonical DDPM or latent diffusion setups, the forward (encoding/noising) process for a latent is:
where indexes the diffusion timestep and is a pre-defined variance schedule. The reverse (decoder) process approximates the true posterior by parameterized denoising U-Nets, possibly conditioned on side information.
The key insight is that lossy quantization-induced errors can be interpreted as particular instances of diffusion noise, allowing the decoder to leverage a powerful generative prior to "denoise" or inpaint realistic image content from reduced, information-constrained signals (Relic et al., 12 Apr 2024, Relic et al., 3 Apr 2025).
2. Compressed Feature Design and Quantization Strategies
Compression in diffusion-based codecs depends on the design and quantization of the analysis transform's output. Mainstream approaches include:
- Latent variable modeling: Learn an encoder and analysis transform producing continuous latents . Quantize and entropy-code via hyperpriors or vector quantization to produce a discrete bitstream at bitrate (Li et al., 3 Oct 2024, Zhang et al., 27 Jun 2025, Xue et al., 22 May 2025).
- Palette or residual representations: Use clustering (e.g., K-means; Markov palette diffusion) in a semantic or texture-feature space to yield compact, content-adaptive codes, with lossy downsampling of the palette size for rate control (Guo et al., 8 Oct 2024).
- Universal quantization alignment: Explicitly match quantization noise statistics to the diffusion trajectory by random-dithered quantization, so that the quantized latents live on the data manifold expected by the pretrained denoiser, mitigating noise mismatch and discretization artifacts (Relic et al., 3 Apr 2025).
- Side-stream and hybrid codes: Supplement the main compressed latent with auxiliary semantic, textual, or structured codes (e.g., OCR-extracted screen content, VQGAN token indices, or hyperprior signals) for region-prioritized or semantic-aware restoration (Xu et al., 9 May 2025, Xue et al., 22 May 2025, Ke et al., 13 May 2025).
Quantization-induced distortion is modeled either as additive noise to be removed via diffusion, or as an explicit corruption path, and precise scheduling of quantization levels and diffusion steps is critical for artifact-free reconstruction.
3. Decoding: Diffusion-Based Generative Reconstruction
The central innovation of DBIC systems is using a conditional diffusion process for the decoding or "reconstruction" stage:
- Multi-step reverse diffusion: Classically, DBICs start from noise (or a randomly initialized latent) and iteratively apply the denoising model, conditioned on compressed latents or side signals, to reconstruct the image. This can require 20–1000 steps, incurring high latency (Relic et al., 12 Apr 2024, Li et al., 3 Oct 2024, Guo et al., 8 Oct 2024).
- Compressed feature initialization: Methods such as RDEIC (Li et al., 3 Oct 2024) start the reverse chain not from pure noise but from the compressed latent itself with a small added noise, cutting the required steps by up to 95% with no quality loss.
- Single-step diffusion: Recent methods (e.g., SODEC, StableCodec, DiffO, OneDC) demonstrate that a sufficiently informative latent allows using a distilled one-step denoiser for direct reconstruction, combining a fast forward pass with hybrid loss functions and semantic or fidelity guidance to maintain quality (Park et al., 19 Jun 2025, Zhang et al., 27 Jun 2025, Xue et al., 22 May 2025, Chen et al., 7 Aug 2025). A single denoising update suffices when the latent includes enough residual or semantic content.
- Hybrid decoders or correction modules: Some designs combine an end-to-end (privileged) decoder with a diffusion model by transmitting side information or learned linear fusion weights, enabling mixing of low-distortion and high-perceptual reconstructions (Ma et al., 7 Apr 2024).
Conditional signals for the diffusion decoder include compressed latents (), semantic or residual text prompts, hyperpriors, auxiliary branch output, palette indices, or even machine-optimized representations (for joint human/machine coding) (Ke et al., 13 May 2025, Guo et al., 8 Oct 2024, Shindo et al., 23 Mar 2025).
4. Training Paradigms and Optimization Objectives
Modern DBICs employ various training strategies to optimize the rate–distortion–perception trade-off:
- Multi-stage curricula: These generally start with high-bitrate or independent-step training, followed by fixed-step fine-tuning to align training and inference schedules (e.g., RDEIC Stage II), or with a high-to-low annealing of rate constraints (Li et al., 3 Oct 2024, Chen et al., 7 Aug 2025).
- Perceptual and adversarial losses: To ensure realism, objectives often blend MSE and LPIPS, sometimes with GAN or CLIP-based perceptual terms (Zhang et al., 27 Jun 2025, Park et al., 19 Jun 2025, Zhou et al., 9 May 2025).
- Consistency and residual losses: Emphasize either semantic consistency (for specific tasks like facial or screen content) or residual coding losses, including hybrid schemes where the residual between backbone decoder output and target is modeled and compressed via diffusion (Zhou et al., 9 May 2025, Song et al., 17 Jul 2024).
- Semantic distillation and guidance: For one-step decoders, semantic information is injected via distilled hyperpriors, token indices, or prompt optimization to compensate for the reduced generative capacity in a non-iterative process (Xue et al., 22 May 2025, Ke et al., 13 May 2025).
- Entropy regularization: All bitrate-constrained methods include explicit or implicit regularization (Lagrangian terms) for rate–distortion upper bounds.
5. Efficiency, Scalability, and Practical Considerations
DBICs have evolved to optimize for both efficiency and operational flexibility:
- Decoding speed: While early diffusion-based codecs suffered from extreme inference latency (hundreds of seconds per megapixel at 1000 steps), fixes such as relay initialization, single-step denoisers, and optimized compressed-feature pipelines reduce decoding time to sub-second, competitive with transform codecs (Li et al., 3 Oct 2024, Park et al., 19 Jun 2025, Zhang et al., 27 Jun 2025).
- Bitrate and perceptual control: Content-adaptive designs (palette-based, universal quantization) and progressive protocols (zero-shot PSC and Turbo-DDCM) allow selection of operating points post-training without retraining; some systems even allow inference-time selection of perceptual vs. distortion-optimal decoders (Guo et al., 8 Oct 2024, Elata et al., 13 Jul 2024, Vaisman et al., 9 Nov 2025).
- Semantic and task scalability: Several approaches provide dual machine/human decoders from the same bitstream, or enable adaptive ROI prioritization, semantic residual enhancement, or explicit region control for screen/face/segmentation use-cases (Xu et al., 9 May 2025, Shindo et al., 23 Mar 2025, Ke et al., 13 May 2025, Zhou et al., 9 May 2025).
- Model and training cost: "Compression-oriented Diffusion" (CoD) demonstrates that training efficient diffusion backbones from scratch with image-only data is feasible, scaling from 49M to 1B parameters, and yielding better or on-par results compared to Stable Diffusion-based pipelines at a fraction of the compute (Jia et al., 24 Nov 2025). Foundation model reuse and LoRA/adapter-based fine-tuning are prevalent for rapid adaptation (Zhang et al., 27 Jun 2025, Park et al., 19 Jun 2025).
6. Quantitative Performance and Benchmarks
State-of-the-art DBIC systems consistently dominate baselines—including VVC, HiFiC, MS-ILLM, GAN-based, and classic learned codecs—on key metrics under extreme compression (0.1 bpp):
| Method | PSNR ↑ | MS-SSIM ↑ | LPIPS ↓ | FID ↓ | Dec. Time (s) |
|---|---|---|---|---|---|
| RDEIC (2 steps) | best | best | best | best | 0.38 |
| DiffO (1 step) | 24.41 | 0.77 | 0.319 | best | 0.25 |
| SODEC (1 step) | SOTA | SOTA | SOTA | SOTA | 0.23 |
| StableCodec (1 step) | best | best | best | best | 0.33 |
Qualitative studies report sharp structure recovery, minimal hallucinations, and plausible perceptual artifacts. Some approaches (e.g., ResULIC with semantic residual coding) demonstrate up to 80% BD-rate savings over prior diffusion methods in LPIPS and FID at 0.005 bpp (Ke et al., 13 May 2025), while single-step decoders achieve $20$x–$50$x speedups over classic samplers (Chen et al., 7 Aug 2025, Park et al., 19 Jun 2025).
7. Limitations, Challenges, and Future Directions
While DBIC systems have set new performance and flexibility milestones, open limitations remain:
- Trade-off flexibility: Control over the realism-fidelity frontier is still an active area. Some approaches attain low perceptual distortion only at the cost of lower PSNR. Effective integration of semantic and residual guidance, especially under severe rates, is ongoing (Ke et al., 13 May 2025, Zhou et al., 9 May 2025).
- Resolution and scalability: While single-step and foundation-oriented models show excellent results on $2$K and moderate-resolution imagery, extensions to 4K and beyond require further adaptation and color-fix strategies (Zhang et al., 27 Jun 2025).
- Zero-shot and universal codecs: Posterior sampling and zero-shot schemes (e.g., PSC, Turbo-DDCM) offer flexibility but can be computationally intensive at present; reducing the number of necessary denoising steps per image remains a research focus (Vaisman et al., 9 Nov 2025, Elata et al., 13 Jul 2024).
- Specialized content: High-frequency semantic consistency for faces or text-rich screen content is challenging; hybrid schemes (e.g., wavelet-based or textual residual coding) are emerging but require further refinement (Zhou et al., 9 May 2025, Gherghetta et al., 16 Jul 2025, Xu et al., 9 May 2025).
- Generative prior mismatch: Discrepancies between the quantization process and the trained noise models (discretization, noise-level, and noise-type gaps) can cause artifacts unless carefully matched via scheduling and retraining (Relic et al., 3 Apr 2025).
- Video and multi-modal extensions: While most current research is single-image focused, combining temporal context and learned priors for video compression is a prospect for future prototypes (Guo et al., 8 Oct 2024, Zhang et al., 27 Jun 2025).
In sum, diffusion-based image compression systems now constitute a foundation for practical, perceptually-driven ultra-low bitrate codecs, merging latent/semantic modeling, adaptive quantization, advanced conditional generative priors, and efficient decoding architectures to approach or surpass human visual plausibility limits while remaining computationally viable for real-world applications (Li et al., 3 Oct 2024, Park et al., 19 Jun 2025, Zhang et al., 27 Jun 2025, Jia et al., 24 Nov 2025).