Versatile Recompression-Aware Perceptual SR

Updated 27 November 2025

The paper introduces VRPSR, a novel framework that integrates a differentiable, diffusion-based codec simulator to optimize perceptual super-resolution in the presence of lossy recompression.
It achieves significant bitrate savings (up to 53.7% on H.266) and improved perceptual quality using a composite loss function that combines MSE, LPIPS, and GAN-based metrics.
The method employs a two-stage training process that pre-trains a codec simulator before fine-tuning the SR network, enabling end-to-end optimization under variable codec conditions.

Versatile Recompression-Aware Perceptual Super-Resolution (VRPSR) is a perceptual super-resolution (SR) framework designed to optimize high-resolution image reconstruction in the presence of downstream lossy recompression, particularly for practical scenarios where restored images are subsequently compressed for storage or transmission. The method addresses the limitations of conventional perceptual SR pipelines, which ignore the non-differentiable and highly variable nature of modern codecs (e.g., H.264, H.265, H.266), by explicitly integrating codec simulation into the SR optimization process (He et al., 22 Nov 2025).

1. Formal Framework and Problem Formulation

Given an unknown high-resolution image $X\in\mathbb{R}^{H\times W\times 3}$ , a degraded low-resolution observation $\tilde X$ , and a black-box codec $f(\cdot;q)$ parameterized by quality factor $q$ , the classical pipeline consists of a degradation model followed by direct compression: $\tilde X\sim p_C(\tilde X\mid X),\qquad (\bar X,\bar R) = f(\tilde X;\,q),$ where $\bar R$ is the bitrate. Super-resolution methods apply an SR network $g_\phi$ to obtain a restoration $X' = g_\phi(\tilde X)$ which is then recompressed at matched bitrate: $(\hat X,\hat R) = f(X';\,q')\qquad\text{with}\quad \hat R\approx \bar R.$ The optimization target for recompression-aware SR is: $\phi^* =\arg\min_\phi\, \Delta_S(\hat X,\,X),$ with $\Delta_S$ representing perceptual distortion metrics such as LPIPS or adversarially-trained GAN losses. Since $f(\cdot)$ is non-differentiable and parameter $q$ varies, direct end-to-end optimization is infeasible.

VRPSR resolves this by introducing (1) a fully differentiable diffusion-based codec simulator $f_\theta(X',c)$ , where $c$ encodes codec metadata (“codec type + target bpp”), and (2) an SR network $g_\phi$ trained to optimize perceptual quality post-recompression via the simulator. The VRPSR inference pipeline is: $X' = g_{\phi}(\tilde X),\qquad \hat X' = f_\theta(X',c)$

2. Diffusion-Based Codec Simulator Design

The VRPSR codec simulator employs a pre-trained conditional diffusion model, such as Stable Diffusion distilled by S3Diff, to approximate the transform of any codec type and bitrate. The compression process is reformulated as a conditional image-to-image generation problem.

Conditioning and Architecture

The simulator receives a prompt $c$ constructed as a textual description, e.g., “A {libx264 {0.3} bpp compressed image}”, augmented with learned embeddings for codec type (discrete) and bitrate (continuous).
The model is based on the DDPM formulation:

$q(x_t\mid x_{t-1},c) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)\mathbf I)$

$p_\theta(x_{t-1}\mid x_t,c) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t,c), \sigma_t^2\mathbf I)$

The denoising network is trained via:

$\mathcal{L}_{\mathrm{denoise}} = \mathbb{E}_{x_0,\varepsilon,t}\|\varepsilon - \varepsilon_\theta(\sqrt{\alpha_t}x_0 + \sqrt{1-\alpha_t}\varepsilon, t, c)\|^2_2$

Simulator Training Objective

Rather than pixel-wise MSE, VRPSR employs a composite perceptual divergence loss:

$\Delta_S(A, B) = \lambda_\mathrm{MSE} \|A-B\|^2_2 + \lambda_\mathrm{LPIPS} \mathrm{LPIPS}(A,B) + \lambda_\mathrm{GAN} \mathcal{L}_\mathrm{GAN}(A)$

with recommended weights $\{\lambda\}=\{2.0, 5.0, 0.5\}$ . Simulator parameters $\theta$ are updated by: $\theta^* = \arg\min_\theta \; \Delta_S(\hat X', \hat X)$ where $\hat X$ is the true codec output and $\hat X'$ is the simulator's prediction.

3. Super-Resolution Network Training

With $f_\theta$ frozen post-convergence, SR network $g_\phi$ is trained by backpropagation through the simulator, subject to several key strategies:

Perceptual Targets: The simulator is trained with perceptual metrics, ensuring that subsequent SR optimization reflects actual perceptual quality rather than only pixel fidelity.
Slightly Compressed Supervision: Instead of clean $X$ , a soft target is used: $X_s = f_{\theta}(X, c')$ , with $c'$ requesting a marginally lower bitrate, stabilizing learning as the target resembles codec-reproducible outputs.
Two-Stage Optimization: Sequential pre-training of the simulator, followed by SR fine-tuning with simulator weights frozen.
No Straight-Through Estimator: The framework avoids ST estimators, using only the simulator’s outputs for training.
Loss Function: The SR objective combines pixel-level, perceptual, and adversarial terms:

$\mathcal{L}_{\mathrm{VRPSR}}(\phi) = \lambda_{\mathrm{rec}} \|g_{\phi}(\tilde X) - X_s\|_1 + \lambda_{\mathrm{perc}} \mathrm{LPIPS}(\hat X', X_s) + \lambda_{\mathrm{adv}} \mathcal{L}_{\mathrm{GAN}}(\hat X')$

typically with $\{\lambda_{\mathrm{rec}}, \lambda_{\mathrm{perc}}, \lambda_{\mathrm{adv}}\} = \{1.0, 5.0, 0.5\}$ .

4. Experimental Methodology

Datasets and Architectures

Training set: ImageNet train split, synthetic degradation as per Real-ESRGAN protocols.
Evaluation: Kodak dataset (24 images) and ImageNet validation set (1,000 images), all at $512\times512$ resolution.
SR Networks: Real-ESRGAN (GAN-based), S3Diff (diffusion-based).
Codecs Simulated: H.264 (x264, CQP), H.265 (x265), H.266 (VVenc), single-frame only.

Training Procedures

Simulator Pre-Training: 60,000 steps, AdamW ( $\beta_1=0.9, \beta_2=0.999, \epsilon=10^{-8}$ , weight decay 0.01, learning rate $2\times10^{-5}$ ), batch size 64 distributed over 8 × A800 GPUs.
SR Fine-Tuning: S3Diff—20,000 steps; Real-ESRGAN—40,000 steps; same optimizer and linear warm-up to cosine decay scheduling.

Metrics

Standard metrics: PSNR, SSIM.
Perceptual metrics: LPIPS, DISTS, FID.
Bitrate comparison: Bjontegaard Delta Bitrate (BDBR).

5. Quantitative and Qualitative Results

Bitrate Savings

VRPSR demonstrates substantial bitrate savings versus baseline SR under fixed perceptual quality:

Backbone	Codec	BDBR Savings (%)
Real-ESRGAN	H.264	17.4
Real-ESRGAN	H.265	26.6
Real-ESRGAN	H.266	33.4
S3Diff	H.264	27.1
S3Diff	H.265	35.2
S3Diff	H.266	53.7

Perceptual Quality at Fixed bpp

On the Kodak dataset at 0.16 bpp (H.264):

Real-ESRGAN: LPIPS improves from 0.403 to 0.388; FID improves from 102.9 to 90.2.
S3Diff: LPIPS improves from 0.436 to 0.404; FID improves from 118.9 to 89.5.

Visual results illustrate that VRPSR delivers sharper reconstructions with reduced blocking and ringing artifacts, especially on high-frequency textures such as brick and foliage.

6. Analysis and Discussion

Generalizability and Differentiability

The conditional diffusion simulator enables a single model to mimic H.264, H.265, and H.266 at arbitrary bpp levels without retraining.
Using a perceptual loss for both simulator and SR network aligns gradients with perceptual, rather than pixelwise, image fidelity.
The differentiable pipeline sidesteps the non-differentiability of standard codecs, permitting end-to-end optimization for recompressed perceptual quality.

Limitations

At extreme compression (lowest bpp regimes), residual artifacts persist. An optional joint post-processing network $h_\psi(\hat X, c)$ can be appended and co-trained with the SR model to further mitigate downstream artifacts.
Only codec type and bpp are directly parameterizable; additional codec controls (e.g., CRF, presets) are not yet integrated.

Future Directions

Extending the diffusion-based simulator to model video codecs with temporal dependencies (e.g., GOP, B-frames).
Broader codec parameter support via richer prompt embeddings.
Building fully neural codecs leveraging the diffusion backbone for direct compression of clean images.

In summary, VRPSR equips existing perceptual super-resolution approaches with explicit awareness of downstream recompression, achieving significant bitrate efficiencies and enhanced perceptual quality across diverse modern codecs by integrating a differentiable conditional diffusion codec simulator and tailored SR training strategies (He et al., 22 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Versatile Recompression-Aware Perceptual Image Super-Resolution (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Versatile Recompression-Aware Perceptual Super-Resolution (VRPSR).