GuideSR: Dual-Branch Diffusion SR
- GuideSR is a diffusion-based image super-resolution approach that integrates a full-resolution Guidance Branch with a latent diffusion branch to preserve structure and enhance perceptual quality.
- It employs a dual-branch design where the Guidance Branch retains fine structural details and the Diffusion Branch, enhanced via LoRA tuning, boosts global perceptual metrics.
- GuideSR achieves state-of-the-art performance on multiple benchmarks while offering efficient single-step inference for real-time image restoration.
GuideSR is a single-step diffusion-based image super-resolution (SR) architecture designed for high-fidelity restoration of degraded inputs. Unlike prior methods that condition restoration on variational autoencoder (VAE) encodings—often at the cost of structural fidelity—GuideSR introduces a novel dual-branch system: a full-resolution Guidance Branch dedicated to structure preservation, and a Diffusion Branch leveraging a pretrained latent diffusion model to enhance perceptual quality. GuideSR demonstrates state-of-the-art performance across multiple SR benchmarks, combining fidelity and efficiency through innovative architectural and training strategies (Arora et al., 1 May 2025).
1. Motivation and Limitations of Prior Single-Step Diffusion SR
Image super-resolution (SR) targets the estimation of a high-resolution (HR) image from a degraded low-resolution (LR) input . Diffusion-based SR methods, such as SR3, StableSR, and DiffBIR, have established diffusion priors as powerful tools for this task, but their requirement for tens to hundreds of denoising steps renders them impractical for real-time applications. Recent single-step approaches—including SinSR and OSEDiff—compress this process to a single UNet denoiser pass by conditioning on a VAE-encoded latent of the LR image. This strategy introduces limitations: aggressive VAE downsampling (e.g., ) erases fine structural content, as the VAE was trained on high-quality data and is ill-equipped for highly degraded sources. Consequently, these models tend to hallucinate textures at the expense of image-specific detail. GuideSR addresses these deficits by operating at full spatial resolution during guidance and fusing this structural information directly into the restoration pipeline.
2. Architecture Overview
GuideSR adopts a dual-branch paradigm, with both branches trained jointly and supervised by a shared adversarial discriminator, but only the Diffusion Branch output is used during inference. The two branches are:
- Guidance Branch: Maintains full spatial resolution to retain structural fidelity.
- Diffusion Branch: Utilizes a latent-space diffusion model to boost perceptual metrics.
Diagrammatic representation (as described in the source):
| Branch | Domain | Key Components |
|---|---|---|
| Guidance Branch | Pixel-space | FRBs, Channel Attention, IGN, PixelUnshuffle |
| Diffusion Branch | VAE Latent Space | Stable Diffusion Turbo v2.1 VAE, Prompt Extractor, UNet, LoRA finetuning |
This division enables the model to harness both detailed structure from the original LR input and global, perceptually plausible enhancements from diffusion modeling.
3. Architectural Details
3.1 Guidance Branch
Input:
- Feature Extraction:
- Deep Encoding: Sequential Full Resolution Blocks (FRBs) with residual-in-residual design and channel attention,
Each FRB:
- Image Guidance Network (IGN): Guided attention mechanism,
Output residual image:
- Cross-Scale Feature Fusion: Downsampling via pixel-unshuffle to create multi-scale structural features:
These are concatenated into the UNet encoder of the Diffusion Branch.
3.2 Diffusion Branch
Input: 0 again
- VAE Encoding: 1, yielding latent of shape 2
- Prompt Features: 3
- Single-Step UNet Denoising: Finetuned with LoRA adapters at a fixed timestep 4, with cross-scale features from the Guidance Branch,
5
- Long-Skip Residual: 6
- VAE Decoding: 7
LoRA adapters are applied for parameter-efficient tuning of all UNet and VAE weights.
4. Training Strategy and Loss Functions
Both branches output predictions (8 from Diffusion, 9 from Guidance) and are jointly supervised by a shared discriminator 0:
- Restoration Loss per Branch:
1
with 2. Terms include: - Mean Squared Error (MSE) - Learned Perceptual Image Patch Similarity (LPIPS) - Adversarial (GAN) loss: 3
- Final Loss Function:
4
with 5.
5. Inference Procedure
GuideSR performs inference in a single pass, as outlined in the provided pseudocode:
7
Inference requires a single UNet and accompanying VAE encode/decode operation, returning only the Diffusion Branch output 6.
6. Empirical Performance
GuideSR is evaluated on widely used synthetic (DIV2K-Val) and real-world (DRealSR, RealSR) datasets, with all inference performed in a single UNet step. Comparative quantitative results are summarized as follows:
| Dataset | Method | Steps | PSNR↑ | SSIM↑ | LPIPS↓ | DISTS↓ | FID↓ |
|---|---|---|---|---|---|---|---|
| DIV2K | ResShift | 15 | 24.65 | 0.6181 | 0.3349 | 0.2213 | 36.11 |
| OSEDiff | 1 | 23.72 | 0.6108 | 0.2941 | 0.1976 | 26.32 | |
| GuideSR | 1 | 24.76 | 0.6333 | 0.2653 | 0.1879 | 21.04 | |
| DRealSR | ResShift | 15 | 28.46 | 0.7673 | 0.4006 | 0.2656 | 172.26 |
| OSEDiff | 1 | 27.92 | 0.7835 | 0.2968 | 0.2165 | 135.30 | |
| GuideSR | 1 | 29.85 | 0.8078 | 0.2640 | 0.1960 | 122.06 | |
| RealSR | ResShift | 15 | 26.31 | 0.7421 | 0.3460 | 0.2498 | 141.71 |
| OSEDiff | 1 | 25.15 | 0.7341 | 0.2921 | 0.2128 | 123.49 | |
| GuideSR | 1 | 27.08 | 0.7681 | 0.2407 | 0.1878 | 96.83 |
Key empirical findings include:
- On DRealSR, GuideSR exceeds the best single-step and multi-step baselines by up to 1.39 dB in PSNR.
- Substantial FID reduction versus OSEDiff: 13.24 on DRealSR, 26.66 on RealSR.
- Consistent improvements across SSIM, LPIPS, and DISTS.
- Qualitative analysis reveals preservation of fine textures, such as text detail, reflective surfaces, and geometric elements, which competing methods blur or inaccurately hallucinate.
7. Computational Efficiency
Compared to traditional multi-step diffusion methods (e.g., StableSR at 200 steps or DiffBIR at 50 steps), which demand extensive UNet runtimes (seconds to minutes per image on A100 GPUs), GuideSR achieves real-time inference. The addition of the Guidance Branch (typically 8–12 FRBs plus IGN) increases floating-point operations by approximately 10% compared to existing single-step frameworks such as OSEDiff. In practical terms, GuideSR achieves end-to-end inference in approximately 0.3–0.5 s per image on A100 hardware, preserving real-time computational feasibility while advancing restoration fidelity.
8. Significance and Practical Implications
GuideSR advances the state of the art in diffusion-based super-resolution by directly addressing the structural fidelity limitations of prior single-step models. The integration of a dedicated full-resolution Guidance Branch with efficient cross-branch fusion ensures both image faithfulness and perceptual enhancement. Achieving improvements across both reference-based pixel metrics (PSNR, SSIM) and perceptual feature distances (LPIPS, DISTS, FID), GuideSR provides a practical, computationally efficient SR solution suitable for real-world image restoration scenarios (Arora et al., 1 May 2025).