DiffCR: Diffusion in Cloud Removal & Compression
- The paper demonstrates that diffusion models can effectively remove clouds from satellite images and enable efficient lossy image compression with high fidelity and rapid inference.
- It introduces a decoupled condition encoder and a frequency-aware consistency module to optimize image reconstruction quality while reducing computational cost.
- Experimental results reveal improved PSNR, SSIM, and reduced latency over traditional GAN and regression methods, highlighting practical benefits in remote sensing and media compression.
DiffCR refers to two independent, state-of-the-art frameworks leveraging diffusion models for high-fidelity conditional image reconstruction—specifically, cloud removal from optical satellite images and efficient low-rate lossy image compression. Both frameworks are conceptually and methodologically distinct but share the common goal of accelerating and improving diffusion-based conditional image generation by introducing specialized architectural and optimization strategies.
1. Background and Motivation
DiffCR, as proposed in (Zou et al., 2023), is a conditional diffusion framework targeting the removal of cloud artifacts in multi-temporal optical satellite imagery. The central challenge is to infer a plausible, cloud-free version of a scene from available cloudy observations , where is spectral bands and is spatial resolution. Existing GAN-based and regression approaches suffer from training instability and limited fidelity, while diffusion models offer stable likelihood-based learning with the ability to incrementally refine image structure via denoising.
A second, independently named DiffCR, introduced in (Xia et al., 15 Jan 2026), addresses lossy image compression at very low bit rates (0.05 bpp) by integrating a pre-trained latent diffusion prior (e.g., Stable Diffusion) with a trainable, frequency-aware consistency module. This framework aims to resolve both the slow inference and bit allocation mismatches that afflict standard diffusion-based codecs, enabling high perceptual quality and significant rate–distortion gains with fast decoding.
2. DiffCR for Cloud Removal: Methodology and Architecture
DiffCR for cloud removal (Zou et al., 2023) is based on a conditional diffusion process, where each denoising step is guided by multi-temporal cloudy inputs. The forward process is parameterized as
using a fixed noise schedule .
The reverse denoising step is
with conditional guidance on the input.
Key architectural components:
- Decoupled Condition Encoder: Extracts condition features at each U-Net stage via TCFBlocks and 2×2 stride convolution. Condition features are cached and fused at each level.
- Time Encoder: Encodes timestep 0 as a sinusoidal embedding 1, mapped to per-channel features 2 via a two-layer MLP with SiLU activations.
- Time and Condition Fusion Block (TCFBlock): Implements joint fusion of noisy features, condition, and time using four submodules: Spatial Extraction (SSA + DWConv), Split Channel Attention (GAP/GMP and pointwise FC), fusion and skip connection, and Feature Recalibration (LN, SSA, and pointwise FC).
- Denoising Autoencoder (U-Net backbone): Encoder path uses TCFBlocks and downsampling, followed by a bottleneck stack, and a decoder path with pixel shuffle upsampling and skip connections.
The data-prediction loss is
3
enforcing high-fidelity, color-accurate synthesis without adversarial or perceptual losses.
3. DiffCR for Compression: Architecture and Algorithms
DiffCR for compression (Xia et al., 15 Jan 2026) departs from canonical diffusion-based codecs by introducing a lightweight, trainable consistency refinement module, FaSE, that works atop a frozen latent diffusion model. The system comprises:
- Learned Latent Compressor & Control Branch: Encodes images into latent 4 with an analysis encoder, then compresses via a VQ-hyperprior model to obtain quantized latent 5. A control branch injects semantic information (e.g., CLIP embeddings), enabling multimodal conditioning.
- Frozen Diffusion Prior: Utilizes a fixed 6 backbone to perform DDIM-based sampling in latent space.
- Consistency Refinement (FaSE, FDA, Lightweight Estimator):
- FaSE parameterizes 7, where 8 is trained to align intermediate 9 estimates with compressed codes.
- Frequency Decoupling Attention (FDA) operates in Fourier space, splitting features by frequency and attending differentially during denoising.
- Lightweight Consistency Estimator 0 is 18M parameters (vs 800M for 2) and is trained with z-prediction and self-consistency objectives to enable two-step high-quality decoding.
- Two-step Decoding Algorithm: Starting from 3, only two invocations of 4 (at 5 and 6) via the DDIM solver reconstruct the latent code, which is then decoded to the image.
4. Experimental Results and Quantitative Analysis
Cloud Removal
On the Sen2_MTC_Old and Sen2_MTC_New benchmarks:
- DiffCR (1 step) achieves PSNR 29.11 dB, SSIM 0.886, FID 89.85, LPIPS 0.258 with 22.91M parameters and 45.86 GMACs.
- Consistently outperforms GAN and prior diffusion baselines by 7 dB PSNR and a substantial decrease in computational cost (5.1% of parameters and 5.4% of MACs relative to DDPM-CR).
- Inference latency is 0.09 s per 8 patch (one step).
- Ablation shows importance of sigmoid noise schedule, direct data-prediction, and decoupled encoder; performance saturates at 3 steps, further steps offer no benefit (Zou et al., 2023).
Compression
On Kodak and CLIC20/DIV2K:
- DiffCR yields 27.2% BD-rate reduction (LPIPS) and 65.1% (PSNR) over previous diffusion codecs; FID is improved by 32.8%.
- Decoding latency is 0.48 s (two steps), over 9 faster than 50-step diffusion decoders, 0 faster than the closest 4-step method.
- Ablation demonstrates that the CRE (FaSE) module provides the largest accuracy gain, with FDA and two-stage training offering further substantial improvements (Xia et al., 15 Jan 2026).
5. Architectural Innovations and Training Paradigms
Both DiffCR frameworks emphasize architectural decoupling and frequency/condition-aware processing:
- Decoupled Condition Encoding (for cloud removal): Preserves spectral statistics and improves appearance similarity between conditional and reconstructed images.
- FaSE and FDA (for compression): Address misalignment between ε-prediction and compressed codes, using frequency-domain attention to focus on coarse structures early and fine details late in the sampling procedure.
- Data-prediction vs. noise-prediction: Cloud removal DiffCR regresses the data directly, which proves optimal for PSNR and color fidelity.
- Unified and staged training: Image compression DiffCR trains compressor, control, and consistency modules jointly before fine-tuning on perceptual distortion.
6. Limitations and Potential Extensions
Limitations for both approaches include:
- For cloud removal: Model can fail in ambiguous scenes (e.g., dark water surfaces where cloud shadows resemble “holes”), and generalization beyond this domain is not demonstrated.
- For compression: Reliance on a large frozen diffusion model propagates pretrained biases, and semantic control side information could be further optimized.
Proposed future work covers:
- Incorporating global or multimodal data (such as SAR for satellite, or improved text/image semantics for compression).
- Generalization to tasks like inpainting, super-resolution, or video computational imaging.
- Further acceleration, e.g., distilling the consistency module for single-step inference.
- Integrating adaptive or learned noise schedules to further minimize sampling steps.
7. Comparative Position and Impact
DiffCR (cloud removal) (Zou et al., 2023) establishes a new state of the art for satellite data restoration, demonstrating the feasibility of conditional diffusion architectures with order-of-magnitude efficiency improvements. DiffCR (compression) (Xia et al., 15 Jan 2026) shows that diffusion priors, when combined with frequency-aware and consistency-enforcing modules, can close the gap in rate–distortion and latency versus GAN and regression methods, while offering semantic/image-level control and extensibility to future modalities.
Both frameworks underscore the trend toward integrating modular, trainable components into foundation models, accelerating inference and improving faithfulness for challenging conditional generation tasks.