GuideSR: Dual-Branch Diffusion SR

Updated 18 May 2026

GuideSR is a diffusion-based image super-resolution approach that integrates a full-resolution Guidance Branch with a latent diffusion branch to preserve structure and enhance perceptual quality.
It employs a dual-branch design where the Guidance Branch retains fine structural details and the Diffusion Branch, enhanced via LoRA tuning, boosts global perceptual metrics.
GuideSR achieves state-of-the-art performance on multiple benchmarks while offering efficient single-step inference for real-time image restoration.

GuideSR is a single-step diffusion-based image super-resolution (SR) architecture designed for high-fidelity restoration of degraded inputs. Unlike prior methods that condition restoration on variational autoencoder (VAE) encodings—often at the cost of structural fidelity—GuideSR introduces a novel dual-branch system: a full-resolution Guidance Branch dedicated to structure preservation, and a Diffusion Branch leveraging a pretrained latent diffusion model to enhance perceptual quality. GuideSR demonstrates state-of-the-art performance across multiple SR benchmarks, combining fidelity and efficiency through innovative architectural and training strategies (Arora et al., 1 May 2025).

1. Motivation and Limitations of Prior Single-Step Diffusion SR

Image super-resolution (SR) targets the estimation of a high-resolution (HR) image $Y$ from a degraded low-resolution (LR) input $I$ . Diffusion-based SR methods, such as SR3, StableSR, and DiffBIR, have established diffusion priors as powerful tools for this task, but their requirement for tens to hundreds of denoising steps renders them impractical for real-time applications. Recent single-step approaches—including SinSR and OSEDiff—compress this process to a single UNet denoiser pass by conditioning on a VAE-encoded latent of the LR image. This strategy introduces limitations: aggressive VAE downsampling (e.g., $8\times$ ) erases fine structural content, as the VAE was trained on high-quality data and is ill-equipped for highly degraded sources. Consequently, these models tend to hallucinate textures at the expense of image-specific detail. GuideSR addresses these deficits by operating at full spatial resolution during guidance and fusing this structural information directly into the restoration pipeline.

2. Architecture Overview

GuideSR adopts a dual-branch paradigm, with both branches trained jointly and supervised by a shared adversarial discriminator, but only the Diffusion Branch output is used during inference. The two branches are:

Guidance Branch: Maintains full spatial resolution to retain structural fidelity.
Diffusion Branch: Utilizes a latent-space diffusion model to boost perceptual metrics.

Diagrammatic representation (as described in the source):

Branch	Domain	Key Components
Guidance Branch	Pixel-space	FRBs, Channel Attention, IGN, PixelUnshuffle
Diffusion Branch	VAE Latent Space	Stable Diffusion Turbo v2.1 VAE, Prompt Extractor, UNet, LoRA finetuning

This division enables the model to harness both detailed structure from the original LR input and global, perceptually plausible enhancements from diffusion modeling.

3. Architectural Details

3.1 Guidance Branch

Input: $I\in\mathbb{R}^{H\times W\times3}$

Feature Extraction: $F_0 = \operatorname{Conv}_{3\rightarrow C}(I)$
Deep Encoding: Sequential Full Resolution Blocks (FRBs) with residual-in-residual design and channel attention,

$F_d = \operatorname{FRGNet}(F_0) = \operatorname{FRB}_n\circ\cdots\circ\operatorname{FRB}_1(F_0)$

Each FRB:

$\operatorname{FRB}(X) = X + \operatorname{Conv}_{M\to M}(X\odot a(X)), \quad a(X) = \sigma\Bigl(W_2\,\operatorname{GELU}(W_1\,\operatorname{GAP}(X))\Bigr)$

Image Guidance Network (IGN): Guided attention mechanism,

$A = \sigma(\operatorname{Conv}_{2C\to 2C}(F_d)), \quad G = \operatorname{Conv}_{2C\to 2C}(F_d), \quad F_r = F_d + A \odot G$

Output residual image:

$R_2 = \operatorname{Conv}_{2C\to3}(F_r) + I$

Cross-Scale Feature Fusion: Downsampling via pixel-unshuffle to create multi-scale structural features:

$F'_r = \operatorname{PixelUnshuffle}(F_r, s)$

These are concatenated into the UNet encoder of the Diffusion Branch.

3.2 Diffusion Branch

Input: $I$ 0 again

VAE Encoding: $I$ 1, yielding latent of shape $I$ 2
Prompt Features: $I$ 3
Single-Step UNet Denoising: Finetuned with LoRA adapters at a fixed timestep $I$ 4, with cross-scale features from the Guidance Branch,

$I$ 5

Long-Skip Residual: $I$ 6
VAE Decoding: $I$ 7

LoRA adapters are applied for parameter-efficient tuning of all UNet and VAE weights.

4. Training Strategy and Loss Functions

Both branches output predictions ( $I$ 8 from Diffusion, $I$ 9 from Guidance) and are jointly supervised by a shared discriminator $8\times$ 0:

Restoration Loss per Branch:

$8\times$ 1

with $8\times$ 2. Terms include: - Mean Squared Error (MSE) - Learned Perceptual Image Patch Similarity (LPIPS) - Adversarial (GAN) loss: $8\times$ 3

Final Loss Function:

$8\times$ 4

with $8\times$ 5.

5. Inference Procedure

GuideSR performs inference in a single pass, as outlined in the provided pseudocode:

$8\times$ 7

Inference requires a single UNet and accompanying VAE encode/decode operation, returning only the Diffusion Branch output $8\times$ 6.

6. Empirical Performance

GuideSR is evaluated on widely used synthetic (DIV2K-Val) and real-world (DRealSR, RealSR) datasets, with all inference performed in a single UNet step. Comparative quantitative results are summarized as follows:

Dataset	Method	Steps	PSNR↑	SSIM↑	LPIPS↓	DISTS↓	FID↓
DIV2K	ResShift	15	24.65	0.6181	0.3349	0.2213	36.11
	OSEDiff	1	23.72	0.6108	0.2941	0.1976	26.32
	GuideSR	1	24.76	0.6333	0.2653	0.1879	21.04
DRealSR	ResShift	15	28.46	0.7673	0.4006	0.2656	172.26
	OSEDiff	1	27.92	0.7835	0.2968	0.2165	135.30
	GuideSR	1	29.85	0.8078	0.2640	0.1960	122.06
RealSR	ResShift	15	26.31	0.7421	0.3460	0.2498	141.71
	OSEDiff	1	25.15	0.7341	0.2921	0.2128	123.49
	GuideSR	1	27.08	0.7681	0.2407	0.1878	96.83

Key empirical findings include:

On DRealSR, GuideSR exceeds the best single-step and multi-step baselines by up to 1.39 dB in PSNR.
Substantial FID reduction versus OSEDiff: 13.24 on DRealSR, 26.66 on RealSR.
Consistent improvements across SSIM, LPIPS, and DISTS.
Qualitative analysis reveals preservation of fine textures, such as text detail, reflective surfaces, and geometric elements, which competing methods blur or inaccurately hallucinate.

7. Computational Efficiency

Compared to traditional multi-step diffusion methods (e.g., StableSR at 200 steps or DiffBIR at 50 steps), which demand extensive UNet runtimes (seconds to minutes per image on A100 GPUs), GuideSR achieves real-time inference. The addition of the Guidance Branch (typically 8–12 FRBs plus IGN) increases floating-point operations by approximately 10% compared to existing single-step frameworks such as OSEDiff. In practical terms, GuideSR achieves end-to-end inference in approximately 0.3–0.5 s per image on A100 hardware, preserving real-time computational feasibility while advancing restoration fidelity.

8. Significance and Practical Implications

GuideSR advances the state of the art in diffusion-based super-resolution by directly addressing the structural fidelity limitations of prior single-step models. The integration of a dedicated full-resolution Guidance Branch with efficient cross-branch fusion ensures both image faithfulness and perceptual enhancement. Achieving improvements across both reference-based pixel metrics (PSNR, SSIM) and perceptual feature distances (LPIPS, DISTS, FID), GuideSR provides a practical, computationally efficient SR solution suitable for real-world image restoration scenarios (Arora et al., 1 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GuideSR.