Papers
Topics
Authors
Recent
2000 character limit reached

Versatile Recompression-Aware Perceptual SR

Updated 27 November 2025
  • The paper introduces VRPSR, a novel framework that integrates a differentiable, diffusion-based codec simulator to optimize perceptual super-resolution in the presence of lossy recompression.
  • It achieves significant bitrate savings (up to 53.7% on H.266) and improved perceptual quality using a composite loss function that combines MSE, LPIPS, and GAN-based metrics.
  • The method employs a two-stage training process that pre-trains a codec simulator before fine-tuning the SR network, enabling end-to-end optimization under variable codec conditions.

Versatile Recompression-Aware Perceptual Super-Resolution (VRPSR) is a perceptual super-resolution (SR) framework designed to optimize high-resolution image reconstruction in the presence of downstream lossy recompression, particularly for practical scenarios where restored images are subsequently compressed for storage or transmission. The method addresses the limitations of conventional perceptual SR pipelines, which ignore the non-differentiable and highly variable nature of modern codecs (e.g., H.264, H.265, H.266), by explicitly integrating codec simulation into the SR optimization process (He et al., 22 Nov 2025).

1. Formal Framework and Problem Formulation

Given an unknown high-resolution image XRH×W×3X\in\mathbb{R}^{H\times W\times 3}, a degraded low-resolution observation X~\tilde X, and a black-box codec f(;q)f(\cdot;q) parameterized by quality factor qq, the classical pipeline consists of a degradation model followed by direct compression: X~pC(X~X),(Xˉ,Rˉ)=f(X~;q),\tilde X\sim p_C(\tilde X\mid X),\qquad (\bar X,\bar R) = f(\tilde X;\,q), where Rˉ\bar R is the bitrate. Super-resolution methods apply an SR network gϕg_\phi to obtain a restoration X=gϕ(X~)X' = g_\phi(\tilde X) which is then recompressed at matched bitrate: (X^,R^)=f(X;q)withR^Rˉ.(\hat X,\hat R) = f(X';\,q')\qquad\text{with}\quad \hat R\approx \bar R. The optimization target for recompression-aware SR is: ϕ=argminϕΔS(X^,X),\phi^* =\arg\min_\phi\, \Delta_S(\hat X,\,X), with ΔS\Delta_S representing perceptual distortion metrics such as LPIPS or adversarially-trained GAN losses. Since f()f(\cdot) is non-differentiable and parameter qq varies, direct end-to-end optimization is infeasible.

VRPSR resolves this by introducing (1) a fully differentiable diffusion-based codec simulator fθ(X,c)f_\theta(X',c), where cc encodes codec metadata (“codec type + target bpp”), and (2) an SR network gϕg_\phi trained to optimize perceptual quality post-recompression via the simulator. The VRPSR inference pipeline is: X=gϕ(X~),X^=fθ(X,c)X' = g_{\phi}(\tilde X),\qquad \hat X' = f_\theta(X',c)

2. Diffusion-Based Codec Simulator Design

The VRPSR codec simulator employs a pre-trained conditional diffusion model, such as Stable Diffusion distilled by S3Diff, to approximate the transform of any codec type and bitrate. The compression process is reformulated as a conditional image-to-image generation problem.

Conditioning and Architecture

  • The simulator receives a prompt cc constructed as a textual description, e.g., “A {libx264 {0.3} bpp compressed image}”, augmented with learned embeddings for codec type (discrete) and bitrate (continuous).
  • The model is based on the DDPM formulation:

q(xtxt1,c)=N(xt;αtxt1,(1αt)I)q(x_t\mid x_{t-1},c) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)\mathbf I)

pθ(xt1xt,c)=N(xt1;μθ(xt,t,c),σt2I)p_\theta(x_{t-1}\mid x_t,c) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t,c), \sigma_t^2\mathbf I)

  • The denoising network is trained via:

Ldenoise=Ex0,ε,tεεθ(αtx0+1αtε,t,c)22\mathcal{L}_{\mathrm{denoise}} = \mathbb{E}_{x_0,\varepsilon,t}\|\varepsilon - \varepsilon_\theta(\sqrt{\alpha_t}x_0 + \sqrt{1-\alpha_t}\varepsilon, t, c)\|^2_2

Simulator Training Objective

Rather than pixel-wise MSE, VRPSR employs a composite perceptual divergence loss:

ΔS(A,B)=λMSEAB22+λLPIPSLPIPS(A,B)+λGANLGAN(A)\Delta_S(A, B) = \lambda_\mathrm{MSE} \|A-B\|^2_2 + \lambda_\mathrm{LPIPS} \mathrm{LPIPS}(A,B) + \lambda_\mathrm{GAN} \mathcal{L}_\mathrm{GAN}(A)

with recommended weights {λ}={2.0,5.0,0.5}\{\lambda\}=\{2.0, 5.0, 0.5\}. Simulator parameters θ\theta are updated by: θ=argminθ  ΔS(X^,X^)\theta^* = \arg\min_\theta \; \Delta_S(\hat X', \hat X) where X^\hat X is the true codec output and X^\hat X' is the simulator's prediction.

3. Super-Resolution Network Training

With fθf_\theta frozen post-convergence, SR network gϕg_\phi is trained by backpropagation through the simulator, subject to several key strategies:

  • Perceptual Targets: The simulator is trained with perceptual metrics, ensuring that subsequent SR optimization reflects actual perceptual quality rather than only pixel fidelity.
  • Slightly Compressed Supervision: Instead of clean XX, a soft target is used: Xs=fθ(X,c)X_s = f_{\theta}(X, c'), with cc' requesting a marginally lower bitrate, stabilizing learning as the target resembles codec-reproducible outputs.
  • Two-Stage Optimization: Sequential pre-training of the simulator, followed by SR fine-tuning with simulator weights frozen.
  • No Straight-Through Estimator: The framework avoids ST estimators, using only the simulator’s outputs for training.
  • Loss Function: The SR objective combines pixel-level, perceptual, and adversarial terms:

LVRPSR(ϕ)=λrecgϕ(X~)Xs1+λpercLPIPS(X^,Xs)+λadvLGAN(X^)\mathcal{L}_{\mathrm{VRPSR}}(\phi) = \lambda_{\mathrm{rec}} \|g_{\phi}(\tilde X) - X_s\|_1 + \lambda_{\mathrm{perc}} \mathrm{LPIPS}(\hat X', X_s) + \lambda_{\mathrm{adv}} \mathcal{L}_{\mathrm{GAN}}(\hat X')

typically with {λrec,λperc,λadv}={1.0,5.0,0.5}\{\lambda_{\mathrm{rec}}, \lambda_{\mathrm{perc}}, \lambda_{\mathrm{adv}}\} = \{1.0, 5.0, 0.5\}.

4. Experimental Methodology

Datasets and Architectures

  • Training set: ImageNet train split, synthetic degradation as per Real-ESRGAN protocols.
  • Evaluation: Kodak dataset (24 images) and ImageNet validation set (1,000 images), all at 512×512512\times512 resolution.
  • SR Networks: Real-ESRGAN (GAN-based), S3Diff (diffusion-based).
  • Codecs Simulated: H.264 (x264, CQP), H.265 (x265), H.266 (VVenc), single-frame only.

Training Procedures

  • Simulator Pre-Training: 60,000 steps, AdamW (β1=0.9,β2=0.999,ϵ=108\beta_1=0.9, \beta_2=0.999, \epsilon=10^{-8}, weight decay 0.01, learning rate 2×1052\times10^{-5}), batch size 64 distributed over 8 × A800 GPUs.
  • SR Fine-Tuning: S3Diff—20,000 steps; Real-ESRGAN—40,000 steps; same optimizer and linear warm-up to cosine decay scheduling.

Metrics

  • Standard metrics: PSNR, SSIM.
  • Perceptual metrics: LPIPS, DISTS, FID.
  • Bitrate comparison: Bjontegaard Delta Bitrate (BDBR).

5. Quantitative and Qualitative Results

Bitrate Savings

VRPSR demonstrates substantial bitrate savings versus baseline SR under fixed perceptual quality:

Backbone Codec BDBR Savings (%)
Real-ESRGAN H.264 17.4
Real-ESRGAN H.265 26.6
Real-ESRGAN H.266 33.4
S3Diff H.264 27.1
S3Diff H.265 35.2
S3Diff H.266 53.7

Perceptual Quality at Fixed bpp

On the Kodak dataset at 0.16 bpp (H.264):

  • Real-ESRGAN: LPIPS improves from 0.403 to 0.388; FID improves from 102.9 to 90.2.
  • S3Diff: LPIPS improves from 0.436 to 0.404; FID improves from 118.9 to 89.5.

Visual results illustrate that VRPSR delivers sharper reconstructions with reduced blocking and ringing artifacts, especially on high-frequency textures such as brick and foliage.

6. Analysis and Discussion

Generalizability and Differentiability

  • The conditional diffusion simulator enables a single model to mimic H.264, H.265, and H.266 at arbitrary bpp levels without retraining.
  • Using a perceptual loss for both simulator and SR network aligns gradients with perceptual, rather than pixelwise, image fidelity.
  • The differentiable pipeline sidesteps the non-differentiability of standard codecs, permitting end-to-end optimization for recompressed perceptual quality.

Limitations

  • At extreme compression (lowest bpp regimes), residual artifacts persist. An optional joint post-processing network hψ(X^,c)h_\psi(\hat X, c) can be appended and co-trained with the SR model to further mitigate downstream artifacts.
  • Only codec type and bpp are directly parameterizable; additional codec controls (e.g., CRF, presets) are not yet integrated.

Future Directions

  • Extending the diffusion-based simulator to model video codecs with temporal dependencies (e.g., GOP, B-frames).
  • Broader codec parameter support via richer prompt embeddings.
  • Building fully neural codecs leveraging the diffusion backbone for direct compression of clean images.

In summary, VRPSR equips existing perceptual super-resolution approaches with explicit awareness of downstream recompression, achieving significant bitrate efficiencies and enhanced perceptual quality across diverse modern codecs by integrating a differentiable conditional diffusion codec simulator and tailored SR training strategies (He et al., 22 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Versatile Recompression-Aware Perceptual Super-Resolution (VRPSR).