Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HiWave: Training-Free High-Res Image Gen

Updated 1 July 2025
  • The paper demonstrates that HiWave’s two-stage pipeline overcomes typical high-resolution artifacts without modifying pretrained models.
  • It combines patchwise DDIM inversion with wavelet-domain frequency guidance to ensure global coherence and enhance fine local details.
  • Experimental results reveal superior perceptual quality and a strong user preference for ultra-high-res outputs up to 4096×4096.

HiWave is a training-free, zero-shot high-resolution image synthesis method based on pretrained diffusion models, designed to overcome common artifacts—such as object duplication and spatial incoherence—when generating images far beyond the training resolution of a given model. The HiWave framework employs a two-stage sampling pipeline, combining patch-wise deterministic diffusion inversion and wavelet-domain frequency-selective guidance to produce globally coherent and detail-rich images at extreme resolutions, such as 4096×4096, without retraining or modifying existing model architectures (2506.20452).

1. Algorithmic Pipeline: Patchwise Inversion and Frequency-Selective Enhancement

HiWave’s methodology consists of the following two stages:

  1. Stage 1 — Base Image Synthesis and Upscaling A base image is generated with the pretrained diffusion model at its canonical resolution (e.g., 1024² for SDXL). This image is upscaled (e.g., with Lanczos interpolation) in pixel space to the target high resolution (>2K or 4K) and re-encoded into the latent space. This procedure preserves global semantics and structural layout but lacks fine detail at large output sizes.
  2. Stage 2 — Patchwise DDIM Inversion and Wavelet-Based Synthesis The upscaled latent is divided into overlapping patches, each patch undergoes DDIM inversion to recover initial stochasticity (latent noise vectors {zTs}\{z_T^s\}) while preserving large-scale structure. During the forward sampling for each patch, a wavelet-based detail enhancer module guides the denoiser output at each step.

Specifically: - Low-frequency coefficients are preserved from the upscaled base, transmitting structural consistency across the canvas. - High-frequency coefficients are enhanced using classifier-free guidance, promoting texture and detail without introducing global artifacts or repetition. - Overlapping regions mitigate boundary effects, and early denoising steps may interpolate between inverted and sampled latents (skip residuals) to stabilize structure.

2. Technical Formulation

a. Patchwise DDIM Inversion

For a reference image patch xrefx_{\text{ref}}:

  • Forward DDIM inversion is used to obtain the starting noise vector zTsz_T^s that reconstructs xrefx_{\text{ref}} under the pretrained model.
  • This ensures that reverse denoising in the subsequent sampling phase starts with a latent embedding that encodes the patch’s desired global semantics.

b. Wavelet-Domain Conditional Guidance

Let Dc(x)D_c(x) be the conditional (text-conditioned) denoiser output and Du(x)D_u(x) the unconditional prediction. The 2D Discrete Wavelet Transform (DWT) decomposes the output into LL (low-frequency), HH (horizontal detail), VV (vertical detail), and DD (diagonal detail):

DWT(Dc(x))={DcL,DcH,DcV,DcD} DWT(Du(x))={DuL,DuH,DuV,DuD}\begin{align*} \text{DWT}(D_c(x)) &= \{ D_c^L, D_c^H, D_c^V, D_c^D \} \ \text{DWT}(D_u(x)) &= \{ D_u^L, D_u^H, D_u^V, D_u^D \} \end{align*}

HiWave constructs a frequency-guided denoiser:

DL=DcL DH=DuH+wd(DcHDuH) DV=DuV+wd(DcVDuV) DD=DuD+wd(DcDDuD)\begin{align*} D^L &= D_c^L \ D^H &= D_u^H + w_d (D_c^H - D_u^H) \ D^V &= D_u^V + w_d (D_c^V - D_u^V) \ D^D &= D_u^D + w_d (D_c^D - D_u^D) \end{align*}

with wdw_d controlling the high-frequency guidance strength. The final output is reconstructed via inverse DWT.

This selective strategy ensures that global (structural) information is preserved directly from the base image, while local (textural) features are adaptively synthesized, mitigating the common trade-off between coherence and detail.

c. Residual Mixing (Skip Residuals)

In early denoising steps (t<τt < \tau), HiWave interpolates between the current latent ztz_t and the DDIM-inverted latent ztsz_t^s: z^t={c1zt+(1c1)zts,t<τ zt,tτ\hat{z}_t = \begin{cases} c_1 \cdot z_t + (1-c_1) \cdot z_t^s, & t < \tau \ z_t, & t \geq \tau \end{cases} with c1c_1 following a cosine decay.

3. Experimental Validation and Comparison

Quantitative and qualitative results using Stable Diffusion XL at resolutions up to 4096×40964096 \times 4096 pixels demonstrate:

  • Superior perceptual quality: HiWave yields globally coherent, sharp, and artifact-free images, outperforming both direct inference (which suffers from repetition and blurring) and contemporary patch-based methods (e.g., Pixelsmith, DemoFusion, HiDiffusion) that commonly exhibit seam artifacts or duplicated content in high-res generations.
  • Quantitative metrics (summarized for 409624096^2):
    • FID: HiWave 64.7, Pixelsmith 62.6, HiDiffusion 93.4
    • IS: HiWave 18.8, Pixelsmith 19.4, HiDiffusion 14.7
    • CLIP and LPIPS are competitive with, or better than, strong baselines, though standard benchmarks may underestimate visual quality due to downsampling for evaluation.
  • User paper: HiWave was preferred in 81.2% of 548 A/B test responses, showing clear subjective superiority over Pixelsmith at 4K resolution.

4. Implications and Significance

HiWave’s approach directly addresses two central limitations of prior training-free high-resolution generation:

  • Global structure preservation: By anchoring sampling in DDIM-inverted noise for all patches and strictly preserving low-frequency wavelet coefficients, HiWave ensures semantic layout continuity.
  • Local detail enhancement: High-frequency wavelet guidance via classifier-free interpolation enables photorealistic detail without sacrificing global arrangement or inducing seams, even across overlapping patches.

Distinct from methods such as dispersed convolutions or dilated U-Net modifications (e.g., in ScaleCrafter, HiDiffusion), HiWave requires no modification or retraining of the backbone model, applying exclusively at the sampling/inference stage.

5. Broader Applications and Future Prospects

Use-cases enabled by HiWave’s training-free paradigm include:

  • Generation of printable, billboard-scale, or large-format media for film, design, and advertising.
  • Photorealistic digital imagery in fine arts, scientific visualization, or satellite imaging, where both intricate texture and structural layout are necessary.
  • Upgrading or enhancing real photographs, as HiWave upscaling produces detailed, artifact-free results beyond typical superresolution models.

A plausible implication is that this frequency-aware patchwise synthesis—potentially with tailored frequency partitioning or extended guidance mechanisms—opens further research directions for training-free, high-resolution image or even video synthesis in resource-constrained environments, removing the bottleneck of retraining for every new scale or application.

6. Comparison Table: HiWave vs. Prior Methods

Method Artifact-Free Structure Boundary Artifacts Detail Fidelity Retraining Needed
HiWave Yes No High No
DemoFusion Sometimes Yes Moderate No
HiDiffusion Sometimes No Lower No
Pixelsmith No (duplication) Yes Moderate No
Direct Inference No (severe artifacts) No Low No

7. Limitations and Considerations

  • While HiWave’s patch-based approach with wavelet guidance suppresses patch seams and maintains global composition, extremely weak or erroneous base images (Stage 1 output) may still limit synthesis fidelity.
  • HiWave’s user-paper preference may be influenced by prompt and resolution distribution; while the reported 81.2% preference is robust over 32 prompts, performance may vary with different subject matter or composition.

8. Conclusion

HiWave establishes a new standard for training-free, ultra-high-resolution image synthesis with diffusion models. Its wavelet-guided, patchwise DDIM inversion pipeline enables the synthesis of images with globally consistent structure and photorealistic details at resolutions previously unattainable without retraining. HiWave's approach demonstrates the effectiveness and generalizability of frequency-domain conditional guidance in scalable generative modeling, and its training-free, modular design offers immediate practical benefit for researchers and industry practitioners seeking high fidelity image generation at extreme resolutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)