Versatile Recompression-Aware Perceptual SR
- The paper introduces VRPSR, a novel framework that integrates a differentiable, diffusion-based codec simulator to optimize perceptual super-resolution in the presence of lossy recompression.
- It achieves significant bitrate savings (up to 53.7% on H.266) and improved perceptual quality using a composite loss function that combines MSE, LPIPS, and GAN-based metrics.
- The method employs a two-stage training process that pre-trains a codec simulator before fine-tuning the SR network, enabling end-to-end optimization under variable codec conditions.
Versatile Recompression-Aware Perceptual Super-Resolution (VRPSR) is a perceptual super-resolution (SR) framework designed to optimize high-resolution image reconstruction in the presence of downstream lossy recompression, particularly for practical scenarios where restored images are subsequently compressed for storage or transmission. The method addresses the limitations of conventional perceptual SR pipelines, which ignore the non-differentiable and highly variable nature of modern codecs (e.g., H.264, H.265, H.266), by explicitly integrating codec simulation into the SR optimization process (He et al., 22 Nov 2025).
1. Formal Framework and Problem Formulation
Given an unknown high-resolution image , a degraded low-resolution observation , and a black-box codec parameterized by quality factor , the classical pipeline consists of a degradation model followed by direct compression: where is the bitrate. Super-resolution methods apply an SR network to obtain a restoration which is then recompressed at matched bitrate: The optimization target for recompression-aware SR is: with representing perceptual distortion metrics such as LPIPS or adversarially-trained GAN losses. Since is non-differentiable and parameter varies, direct end-to-end optimization is infeasible.
VRPSR resolves this by introducing (1) a fully differentiable diffusion-based codec simulator , where encodes codec metadata (“codec type + target bpp”), and (2) an SR network trained to optimize perceptual quality post-recompression via the simulator. The VRPSR inference pipeline is:
2. Diffusion-Based Codec Simulator Design
The VRPSR codec simulator employs a pre-trained conditional diffusion model, such as Stable Diffusion distilled by S3Diff, to approximate the transform of any codec type and bitrate. The compression process is reformulated as a conditional image-to-image generation problem.
Conditioning and Architecture
- The simulator receives a prompt constructed as a textual description, e.g., “A {libx264 {0.3} bpp compressed image}”, augmented with learned embeddings for codec type (discrete) and bitrate (continuous).
- The model is based on the DDPM formulation:
- The denoising network is trained via:
Simulator Training Objective
Rather than pixel-wise MSE, VRPSR employs a composite perceptual divergence loss:
with recommended weights . Simulator parameters are updated by: where is the true codec output and is the simulator's prediction.
3. Super-Resolution Network Training
With frozen post-convergence, SR network is trained by backpropagation through the simulator, subject to several key strategies:
- Perceptual Targets: The simulator is trained with perceptual metrics, ensuring that subsequent SR optimization reflects actual perceptual quality rather than only pixel fidelity.
- Slightly Compressed Supervision: Instead of clean , a soft target is used: , with requesting a marginally lower bitrate, stabilizing learning as the target resembles codec-reproducible outputs.
- Two-Stage Optimization: Sequential pre-training of the simulator, followed by SR fine-tuning with simulator weights frozen.
- No Straight-Through Estimator: The framework avoids ST estimators, using only the simulator’s outputs for training.
- Loss Function: The SR objective combines pixel-level, perceptual, and adversarial terms:
typically with .
4. Experimental Methodology
Datasets and Architectures
- Training set: ImageNet train split, synthetic degradation as per Real-ESRGAN protocols.
- Evaluation: Kodak dataset (24 images) and ImageNet validation set (1,000 images), all at resolution.
- SR Networks: Real-ESRGAN (GAN-based), S3Diff (diffusion-based).
- Codecs Simulated: H.264 (x264, CQP), H.265 (x265), H.266 (VVenc), single-frame only.
Training Procedures
- Simulator Pre-Training: 60,000 steps, AdamW (, weight decay 0.01, learning rate ), batch size 64 distributed over 8 × A800 GPUs.
- SR Fine-Tuning: S3Diff—20,000 steps; Real-ESRGAN—40,000 steps; same optimizer and linear warm-up to cosine decay scheduling.
Metrics
- Standard metrics: PSNR, SSIM.
- Perceptual metrics: LPIPS, DISTS, FID.
- Bitrate comparison: Bjontegaard Delta Bitrate (BDBR).
5. Quantitative and Qualitative Results
Bitrate Savings
VRPSR demonstrates substantial bitrate savings versus baseline SR under fixed perceptual quality:
| Backbone | Codec | BDBR Savings (%) |
|---|---|---|
| Real-ESRGAN | H.264 | 17.4 |
| Real-ESRGAN | H.265 | 26.6 |
| Real-ESRGAN | H.266 | 33.4 |
| S3Diff | H.264 | 27.1 |
| S3Diff | H.265 | 35.2 |
| S3Diff | H.266 | 53.7 |
Perceptual Quality at Fixed bpp
On the Kodak dataset at 0.16 bpp (H.264):
- Real-ESRGAN: LPIPS improves from 0.403 to 0.388; FID improves from 102.9 to 90.2.
- S3Diff: LPIPS improves from 0.436 to 0.404; FID improves from 118.9 to 89.5.
Visual results illustrate that VRPSR delivers sharper reconstructions with reduced blocking and ringing artifacts, especially on high-frequency textures such as brick and foliage.
6. Analysis and Discussion
Generalizability and Differentiability
- The conditional diffusion simulator enables a single model to mimic H.264, H.265, and H.266 at arbitrary bpp levels without retraining.
- Using a perceptual loss for both simulator and SR network aligns gradients with perceptual, rather than pixelwise, image fidelity.
- The differentiable pipeline sidesteps the non-differentiability of standard codecs, permitting end-to-end optimization for recompressed perceptual quality.
Limitations
- At extreme compression (lowest bpp regimes), residual artifacts persist. An optional joint post-processing network can be appended and co-trained with the SR model to further mitigate downstream artifacts.
- Only codec type and bpp are directly parameterizable; additional codec controls (e.g., CRF, presets) are not yet integrated.
Future Directions
- Extending the diffusion-based simulator to model video codecs with temporal dependencies (e.g., GOP, B-frames).
- Broader codec parameter support via richer prompt embeddings.
- Building fully neural codecs leveraging the diffusion backbone for direct compression of clean images.
In summary, VRPSR equips existing perceptual super-resolution approaches with explicit awareness of downstream recompression, achieving significant bitrate efficiencies and enhanced perceptual quality across diverse modern codecs by integrating a differentiable conditional diffusion codec simulator and tailored SR training strategies (He et al., 22 Nov 2025).