DiffCR: Diffusion in Cloud Removal & Compression

Updated 9 April 2026

The paper demonstrates that diffusion models can effectively remove clouds from satellite images and enable efficient lossy image compression with high fidelity and rapid inference.
It introduces a decoupled condition encoder and a frequency-aware consistency module to optimize image reconstruction quality while reducing computational cost.
Experimental results reveal improved PSNR, SSIM, and reduced latency over traditional GAN and regression methods, highlighting practical benefits in remote sensing and media compression.

DiffCR refers to two independent, state-of-the-art frameworks leveraging diffusion models for high-fidelity conditional image reconstruction—specifically, cloud removal from optical satellite images and efficient low-rate lossy image compression. Both frameworks are conceptually and methodologically distinct but share the common goal of accelerating and improving diffusion-based conditional image generation by introducing specialized architectural and optimization strategies.

1. Background and Motivation

DiffCR, as proposed in (Zou et al., 2023), is a conditional diffusion framework targeting the removal of cloud artifacts in multi-temporal optical satellite imagery. The central challenge is to infer a plausible, cloud-free version $y_0$ of a scene from $N$ available cloudy observations $x \in \mathbb{R}^{N \times C \times H \times W}$ , where $C$ is spectral bands and $H \times W$ is spatial resolution. Existing GAN-based and regression approaches suffer from training instability and limited fidelity, while diffusion models offer stable likelihood-based learning with the ability to incrementally refine image structure via denoising.

A second, independently named DiffCR, introduced in (Xia et al., 15 Jan 2026), addresses lossy image compression at very low bit rates ( $\leq$ 0.05 bpp) by integrating a pre-trained latent diffusion prior (e.g., Stable Diffusion) with a trainable, frequency-aware consistency module. This framework aims to resolve both the slow inference and bit allocation mismatches that afflict standard diffusion-based codecs, enabling high perceptual quality and significant rate–distortion gains with fast decoding.

2. DiffCR for Cloud Removal: Methodology and Architecture

DiffCR for cloud removal (Zou et al., 2023) is based on a conditional diffusion process, where each denoising step is guided by multi-temporal cloudy inputs. The forward process is parameterized as

$q(y_{1:T} | y_0) = \prod_{t=1}^T q(y_t | y_{t-1}), \ \ q(y_t | y_{t-1}) = \mathcal{N}(y_t; \sqrt{1-\beta_t}y_{t-1}, \beta_t I),$

using a fixed noise schedule $\{\beta_t\}$ .

The reverse denoising step is

$p_\theta(y_{t-1} | y_t, x) = \mathcal{N}(y_{t-1}; \mu_\theta(y_t, t, x), \Sigma_\theta(y_t, t, x)),$

with conditional guidance on the input.

Key architectural components:

Decoupled Condition Encoder: Extracts condition features $F_c^\ell$ at each U-Net stage via TCFBlocks and 2×2 stride convolution. Condition features are cached and fused at each level.
Time Encoder: Encodes timestep $N$ 0 as a sinusoidal embedding $N$ 1, mapped to per-channel features $N$ 2 via a two-layer MLP with SiLU activations.
Time and Condition Fusion Block (TCFBlock): Implements joint fusion of noisy features, condition, and time using four submodules: Spatial Extraction (SSA + DWConv), Split Channel Attention (GAP/GMP and pointwise FC), fusion and skip connection, and Feature Recalibration (LN, SSA, and pointwise FC).
Denoising Autoencoder (U-Net backbone): Encoder path uses TCFBlocks and downsampling, followed by a bottleneck stack, and a decoder path with pixel shuffle upsampling and skip connections.

The data-prediction loss is

$N$ 3

enforcing high-fidelity, color-accurate synthesis without adversarial or perceptual losses.

3. DiffCR for Compression: Architecture and Algorithms

DiffCR for compression (Xia et al., 15 Jan 2026) departs from canonical diffusion-based codecs by introducing a lightweight, trainable consistency refinement module, FaSE, that works atop a frozen latent diffusion model. The system comprises:

Learned Latent Compressor & Control Branch: Encodes images into latent $N$ 4 with an analysis encoder, then compresses via a VQ-hyperprior model to obtain quantized latent $N$ 5. A control branch injects semantic information (e.g., CLIP embeddings), enabling multimodal conditioning.
Frozen Diffusion Prior: Utilizes a fixed $N$ 6 backbone to perform DDIM-based sampling in latent space.
Consistency Refinement (FaSE, FDA, Lightweight Estimator):
- FaSE parameterizes $N$ 7, where $N$ 8 is trained to align intermediate $N$ 9 estimates with compressed codes.
- Frequency Decoupling Attention (FDA) operates in Fourier space, splitting features by frequency and attending differentially during denoising.
- Lightweight Consistency Estimator $x \in \mathbb{R}^{N \times C \times H \times W}$ 0 is $x \in \mathbb{R}^{N \times C \times H \times W}$ 18M parameters (vs 800M for $x \in \mathbb{R}^{N \times C \times H \times W}$ 2) and is trained with z-prediction and self-consistency objectives to enable two-step high-quality decoding.
Two-step Decoding Algorithm: Starting from $x \in \mathbb{R}^{N \times C \times H \times W}$ 3, only two invocations of $x \in \mathbb{R}^{N \times C \times H \times W}$ 4 (at $x \in \mathbb{R}^{N \times C \times H \times W}$ 5 and $x \in \mathbb{R}^{N \times C \times H \times W}$ 6) via the DDIM solver reconstruct the latent code, which is then decoded to the image.

4. Experimental Results and Quantitative Analysis

Cloud Removal

On the Sen2_MTC_Old and Sen2_MTC_New benchmarks:

DiffCR (1 step) achieves PSNR 29.11 dB, SSIM 0.886, FID 89.85, LPIPS 0.258 with 22.91M parameters and 45.86 GMACs.
Consistently outperforms GAN and prior diffusion baselines by $x \in \mathbb{R}^{N \times C \times H \times W}$ 7 dB PSNR and a substantial decrease in computational cost (5.1% of parameters and 5.4% of MACs relative to DDPM-CR).
Inference latency is 0.09 s per $x \in \mathbb{R}^{N \times C \times H \times W}$ 8 patch (one step).
Ablation shows importance of sigmoid noise schedule, direct data-prediction, and decoupled encoder; performance saturates at 3 steps, further steps offer no benefit (Zou et al., 2023).

Compression

On Kodak and CLIC20/DIV2K:

DiffCR yields 27.2% BD-rate reduction (LPIPS) and 65.1% (PSNR) over previous diffusion codecs; FID is improved by 32.8%.
Decoding latency is 0.48 s (two steps), over $x \in \mathbb{R}^{N \times C \times H \times W}$ 9 faster than 50-step diffusion decoders, $C$ 0 faster than the closest 4-step method.
Ablation demonstrates that the CRE (FaSE) module provides the largest accuracy gain, with FDA and two-stage training offering further substantial improvements (Xia et al., 15 Jan 2026).

5. Architectural Innovations and Training Paradigms

Both DiffCR frameworks emphasize architectural decoupling and frequency/condition-aware processing:

Decoupled Condition Encoding (for cloud removal): Preserves spectral statistics and improves appearance similarity between conditional and reconstructed images.
FaSE and FDA (for compression): Address misalignment between ε-prediction and compressed codes, using frequency-domain attention to focus on coarse structures early and fine details late in the sampling procedure.
Data-prediction vs. noise-prediction: Cloud removal DiffCR regresses the data directly, which proves optimal for PSNR and color fidelity.
Unified and staged training: Image compression DiffCR trains compressor, control, and consistency modules jointly before fine-tuning on perceptual distortion.

6. Limitations and Potential Extensions

Limitations for both approaches include:

For cloud removal: Model can fail in ambiguous scenes (e.g., dark water surfaces where cloud shadows resemble “holes”), and generalization beyond this domain is not demonstrated.
For compression: Reliance on a large frozen diffusion model propagates pretrained biases, and semantic control side information could be further optimized.

Proposed future work covers:

Incorporating global or multimodal data (such as SAR for satellite, or improved text/image semantics for compression).
Generalization to tasks like inpainting, super-resolution, or video computational imaging.
Further acceleration, e.g., distilling the consistency module for single-step inference.
Integrating adaptive or learned noise schedules to further minimize sampling steps.

7. Comparative Position and Impact

DiffCR (cloud removal) (Zou et al., 2023) establishes a new state of the art for satellite data restoration, demonstrating the feasibility of conditional diffusion architectures with order-of-magnitude efficiency improvements. DiffCR (compression) (Xia et al., 15 Jan 2026) shows that diffusion priors, when combined with frequency-aware and consistency-enforcing modules, can close the gap in rate–distortion and latency versus GAN and regression methods, while offering semantic/image-level control and extensibility to future modalities.

Both frameworks underscore the trend toward integrating modular, trainable components into foundation models, accelerating inference and improving faithfulness for challenging conditional generation tasks.

Markdown Report Issue Upgrade to Chat

References (2)

DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal from Optical Satellite Images (2023)

Towards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiffCR.