Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wavelet-Based Diffusion Sampling

Updated 1 July 2025
  • Wavelet-based diffusion sampling integrates wavelet transforms into diffusion model pipelines to guide sampling and enhance specific frequency components like high-frequency details.
  • This approach enables training-free generation of ultra-high-resolution images (up to 8K) by methods like HiWave, improving detail and preserving global structure.
  • It effectively mitigates common artifacts such as object duplication, blurry textures, and boundary discontinuities often seen in prior patch-based or naive upscaling methods.

Wavelet-based diffusion sampling refers to frameworks and algorithms that leverage wavelet transforms within the sampling, training, or inference procedures of diffusion models. HiWave exemplifies this approach in the context of high-resolution image generation, introducing a principled pipeline that addresses key limitations of zero-shot and patch-based diffusion upscaling through explicit frequency-domain manipulation.

1. HiWave Methodology: Structure and Frequency Awareness

HiWave implements a training-free, zero-shot pipeline for synthesizing ultra-high-resolution images (e.g., up to 8K) using only pretrained diffusion models, such as Stable Diffusion XL (SDXL). The method is designed to:

  • Preserve global composition and semantic coherence across the large field of view.
  • Enhance local detail and texture at scales beyond the model's native resolution.
  • Mitigate common artifacts including object duplication and boundary discontinuities encountered in prior patch-wise or naive upscaling approaches.

The methodology consists of two principal stages:

(a) Base Image Generation and Upscaling

  1. Low-Resolution Synthesis: A base image is first generated at the maximum resolution supported by the pretrained model (typically 1024×1024).
  2. Image-space Upscaling: The base image is upscaled using high-fidelity interpolation (e.g., Lanczos) to the ultra-high target resolution (e.g., 4096×4096), deliberately avoiding latent (VAE) upscaling, which produces artifacts.
  3. Latent Preparation: The upscaled image is encoded back into the diffusion model’s latent space via the VAE encoder for subsequent refinement.

(b) Patch-wise DDIM Inversion with Wavelet Domain Detail Enhancement

  1. Patch-based Inversion: The upscaled image is partitioned into overlapping patches. For each, DDIM inversion retrieves a matching noise vector, establishing initial conditions that are globally consistent with the upscaled structure.
  2. Wavelet-Based Detail Enhancement:
    • At each reverse sampling step, HiWave applies a Discrete Wavelet Transform (DWT) to each patch’s current prediction.
    • Low-frequency (LF) bands are preserved from the base conditional output, stabilizing global composition.
    • High-frequency (HF) bands are selectively guided using an enhanced version of classifier-free guidance: HF coefficients are interpolated between unconditional and conditional predictions, promoting higher detail and realism.
    • The inverse DWT reconstructs the patch, which is then harmonized with neighbors for seamless blending.

2. Mathematical Foundation of Frequency-Guided Sampling

At every patch and step, denoiser outputs are decomposed as:

DWT(Dc(x))={L,DcH(x),DcV(x),DcD(x)}\mathrm{DWT}(D_c(x)) = \{L, D_c^H(x), D_c^V(x), D_c^D(x)\}

DWT(Du(x))={L,DuH(x),DuV(x),DuD(x)}\mathrm{DWT}(D_u(x)) = \{L, D_u^H(x), D_u^V(x), D_u^D(x)\}

The frequency-guided sample prediction is:

DL=L DH=DuH(x)+wd(DcH(x)DuH(x)) DV=DuV(x)+wd(DcV(x)DuV(x)) DD=DuD(x)+wd(DcD(x)DuD(x))\begin{align*} D_L &= L \ D_H &= D_u^H(x) + w_d (D_c^H(x) - D_u^H(x)) \ D_V &= D_u^V(x) + w_d (D_c^V(x) - D_u^V(x)) \ D_D &= D_u^D(x) + w_d (D_c^D(x) - D_u^D(x)) \end{align*}

with reconstructed prediction:

D=iDWT(DL,DH,DV,DD)D = \mathrm{iDWT}(D_L, D_H, D_V, D_D)

Where:

  • Dc(x)D_c(x): Conditional (prompted) denoiser output.
  • Du(x)D_u(x): Unconditional denoiser output.
  • wdw_d: Frequency-band-specific guidance scale (typically larger than standard guidance, e.g., 10–20 for HF).

Skip-residuals: Initial steps blend the inverted latent with the current state via a cosine-decayed schedule, further stabilizing structure.

3. Evaluation: Visual Quality and User Preference

Quantitative and Qualitative Performance

  • Standard metrics (FID, KID, IS, CLIP, LPIPS, HPS-v2) show HiWave consistently matches or surpasses state-of-the-art baselines (e.g., Pixelsmith, DemoFusion) at 2048² and 4096² resolutions.
  • Qualitative inspection reveals:
    • Substantially sharper textures and more realistic details in HF regions.
    • Preserved global semantic layout, without object duplication or patch boundaries typical of prior methods.

User Study Results

In a large-scale human preference paper (32 pairs, 4096×4096), HiWave was preferred over Pixelsmith in more than 80% of cases, including several unanimous preferences, confirming the substantial perceptual gains of wavelet-domain detail enhancement.

4. Implications for Zero-shot High-Resolution Synthesis

Applications

  • Creative industries: Generation of photorealistic or stylized images at poster, billboard, or cinematic resolutions.
  • Super-resolution and restoration: HiWave enables upscaling both AI-generated and real images, prioritizing semantic structure and detail.
  • Prompt-adaptive upscaling: Prompt information can robustly influence the fine details through classifier-free conditional guidance in the wavelet domain.

Scientific and Technical Advances

  • Training-free upscaling: HiWave extends the expressive range of pretrained models without any finetuning or architectural modification, enabling resource-efficient deployment.
  • Separation of structure and detail: The DWT-based approach instantiates a rigorous means to prioritize structural consistency (LF) while adaptively boosting realism where it is needed (HF).
  • Potential in scalable generative design: The frequency-based guidance paradigm may be generalized for other tasks such as inpainting, restoration, or domain adaptation.

5. Workflow Summary Table

Stage Functionality Significance
Base image gen + upscaling Generates semantically faithful, high-res seed image Avoids upsampling artifacts, maintains coherence
Patch-wise DDIM inversion Aligns each region with global structure Prevents patch seams and object duplication
Wavelet-based enhancement LF → structure, HF → detail guidance Balances realism and fidelity, prevents blurring
Progressive multi-stage Allows stepwise (1024→2048→4096) upscaling Ensures stability and maximizes visual richness

6. Conclusion

Wavelet-based diffusion sampling in HiWave establishes a principled frequency-aware enhancement paradigm for ultra-high-resolution image synthesis. By explicitly decoupling and selectively guiding low- and high-frequency components at each denoising step, HiWave achieves perceptually superior images that retain both global semantic structure and fine-grained realism. The method is fully training-free and compatible with existing pretrained diffusion models, offering a practical, scalable solution for next-generation image generation tasks.