HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
(2506.20452v1)
Published 25 Jun 2025 in cs.CV and cs.LG
Abstract: Diffusion models have emerged as the leading approach for image synthesis, demonstrating exceptional photorealism and diversity. However, training diffusion models at high resolutions remains computationally prohibitive, and existing zero-shot generation techniques for synthesizing images beyond training resolutions often produce artifacts, including object duplication and spatial incoherence. In this paper, we introduce HiWave, a training-free, zero-shot approach that substantially enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis using pretrained diffusion models. Our method employs a two-stage pipeline: generating a base image from the pretrained model followed by a patch-wise DDIM inversion step and a novel wavelet-based detail enhancer module. Specifically, we first utilize inversion methods to derive initial noise vectors that preserve global coherence from the base image. Subsequently, during sampling, our wavelet-domain detail enhancer retains low-frequency components from the base image to ensure structural consistency, while selectively guiding high-frequency components to enrich fine details and textures. Extensive evaluations using Stable Diffusion XL demonstrate that HiWave effectively mitigates common visual artifacts seen in prior methods, achieving superior perceptual quality. A user study confirmed HiWave's performance, where it was preferred over the state-of-the-art alternative in more than 80% of comparisons, highlighting its effectiveness for high-quality, ultra-high-resolution image synthesis without requiring retraining or architectural modifications.
The paper introduces HiWave, a training-free method that combines patch-wise DDIM inversion with wavelet-based detail enhancement to generate ultra-high-resolution images.
It overcomes challenges in maintaining global structure and avoiding artifacts, demonstrating superior coherence and refined textures over previous methods.
Empirical evaluations show HiWave extends SDXL outputs from 1024×1024 to resolutions up to 8192×8192 on consumer GPUs while delivering high visual quality.
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
The paper introduces HiWave, a training-free, zero-shot pipeline for ultra-high-resolution image synthesis using pretrained diffusion models, with a particular focus on extending the capabilities of models such as Stable Diffusion XL (SDXL) from their native 1024×1024 resolution to 4096×4096 and beyond (Vontobel et al., 25 Jun 2025). The method addresses two persistent challenges in high-resolution image generation: (1) maintaining global structural coherence and (2) avoiding artifacts such as object duplication, which are prevalent in prior patch-based and direct inference approaches.
Methodological Contributions
HiWave is structured as a two-stage pipeline:
Base Image Generation and Upscaling: A base image is generated at the model’s native resolution (e.g., 1024×1024) using a pretrained diffusion model. This image is then upscaled in the image domain (using Lanczos interpolation) to the target high resolution. The choice to upscale in image space, rather than latent space, is empirically justified by the avoidance of spatial artifacts due to the lack of scaling equivariance in standard VAEs.
Patch-wise DDIM Inversion and Wavelet-Based Detail Enhancement: The upscaled image is encoded back into the latent space, and a patch-wise DDIM inversion is performed to recover the noise vectors corresponding to each patch. This deterministic inversion ensures that the initial noise for each patch is structurally consistent with the base image, mitigating boundary artifacts and patch seams.
During the subsequent denoising process, HiWave introduces a novel wavelet-based detail enhancer. The key insight is to decompose the denoiser’s output into low- and high-frequency components via the discrete wavelet transform (DWT). The low-frequency bands, which encode global structure, are preserved from the base image, while the high-frequency bands are adaptively guided using a modified classifier-free guidance (CFG) strategy. This frequency-selective guidance enables the synthesis of fine details and textures without sacrificing global coherence.
Additionally, skip residuals are incorporated during early denoising steps, blending the DDIM-inverted latents with the current latents to further anchor the generation to the base image’s structure. The skip residuals are applied only in the initial steps, allowing the model to diverge and synthesize novel details in later steps.
Empirical Evaluation
The paper provides extensive qualitative and quantitative comparisons against state-of-the-art high-resolution generation methods, including Pixelsmith and HiDiffusion. The evaluation protocol is rigorous, employing 1000 prompts from the LAION2B-en-aesthetic dataset and conducting all experiments on a single RTX 4090 GPU.
Key empirical findings:
Artifact Reduction: HiWave effectively eliminates object duplication and spatial incoherence, which are prominent in patch-based methods such as Pixelsmith and DemoFusion.
Detail Enhancement: The method consistently produces images with sharper textures and enhanced fine details compared to both the base SDXL outputs and prior upscaling methods.
Human Preference: In a blind A/B user paper with 548 evaluations across 32 image pairs, HiWave was preferred in 81.2% of cases over Pixelsmith, with several cases achieving unanimous preference.
Scalability: HiWave demonstrates the ability to scale to 8192×8192 resolution, maintaining structural coherence and detail, and can be applied to real (non-synthetic) images for high-quality upscaling.
Quantitative metrics (FID, KID, IS, CLIP, LPIPS, HPS-v2) are reported, but the authors note the limitations of these metrics at high resolutions due to the necessity of downsampling, which can obscure improvements in fine detail and perceptual quality.
Implementation Considerations
Patch Processing: Patches are processed with 50% overlap and in a streaming fashion, enabling 4096×4096 generation on consumer GPUs with 24GB VRAM.
Wavelet Choice: The sym4 wavelet is used for DWT, balancing spatial and frequency localization.
Guidance Strength: The high-frequency guidance strength is set to 7.5, empirically tuned for detail enhancement without introducing artifacts.
Progressive Upscaling: Multistep upscaling (1024→2048→4096) is favored over one-shot upscaling, as it yields sharper details without increasing duplication artifacts.
Theoretical and Practical Implications
HiWave’s approach demonstrates that pretrained diffusion models can be extended to ultra-high resolutions without retraining or architectural modifications, provided that frequency-aware guidance and structurally consistent initialization are employed. The explicit separation of low- and high-frequency guidance in the wavelet domain is a principled solution to the trade-off between global coherence and local detail, and the method is robust to a wide range of content types.
Practical implications include:
Production-Ready 4K+ Synthesis: HiWave enables the use of existing diffusion models in domains requiring ultra-high-resolution outputs (e.g., advertising, film, digital art) without the prohibitive cost of retraining.
Real-Image Enhancement: The method can be applied to real photographs, not just synthetic images, for high-quality upscaling and detail enhancement.
Resource Efficiency: The pipeline is compatible with consumer hardware, making high-resolution synthesis accessible to a broader user base.
Limitations and Future Directions
Inference Time: HiWave’s runtime is higher than some direct inference methods due to the patch-based and progressive upscaling strategy, though this is justified by the superior visual quality.
Metric Limitations: The inadequacy of standard quantitative metrics at high resolutions highlights the need for new evaluation protocols that better capture perceptual quality at scale.
Potential for Video and 3D: The frequency-aware guidance framework could be extended to video synthesis and 3D generative modeling, where temporal and spatial coherence are critical.
Conclusion
HiWave represents a significant advance in training-free, high-resolution image generation with diffusion models. By integrating patch-wise DDIM inversion and wavelet-domain frequency guidance, it achieves a strong balance between global structure and fine detail, outperforming prior methods in both qualitative and human preference evaluations. The approach is practical, scalable, and broadly applicable, with clear implications for both research and industry applications in high-fidelity generative modeling.