HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling (2506.20452v1)

Published 25 Jun 2025 in cs.CV and cs.LG

Abstract: Diffusion models have emerged as the leading approach for image synthesis, demonstrating exceptional photorealism and diversity. However, training diffusion models at high resolutions remains computationally prohibitive, and existing zero-shot generation techniques for synthesizing images beyond training resolutions often produce artifacts, including object duplication and spatial incoherence. In this paper, we introduce HiWave, a training-free, zero-shot approach that substantially enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis using pretrained diffusion models. Our method employs a two-stage pipeline: generating a base image from the pretrained model followed by a patch-wise DDIM inversion step and a novel wavelet-based detail enhancer module. Specifically, we first utilize inversion methods to derive initial noise vectors that preserve global coherence from the base image. Subsequently, during sampling, our wavelet-domain detail enhancer retains low-frequency components from the base image to ensure structural consistency, while selectively guiding high-frequency components to enrich fine details and textures. Extensive evaluations using Stable Diffusion XL demonstrate that HiWave effectively mitigates common visual artifacts seen in prior methods, achieving superior perceptual quality. A user study confirmed HiWave's performance, where it was preferred over the state-of-the-art alternative in more than 80% of comparisons, highlighting its effectiveness for high-quality, ultra-high-resolution image synthesis without requiring retraining or architectural modifications.

PDF Abstract

HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling

The paper introduces HiWave, a training-free, zero-shot pipeline for ultra-high-resolution image synthesis using pretrained diffusion models, with a particular focus on extending the capabilities of models such as Stable Diffusion XL (SDXL) from their native 1024×1024 resolution to 4096×4096 and beyond. The method addresses the persistent challenges in high-resolution generation: computational infeasibility of direct high-res training, duplication artifacts in patch-based methods, and global incoherence in direct inference approaches.

Methodological Contributions

HiWave is structured as a two-stage pipeline:

Base Image Generation and Upscaling: A base image is generated at the model’s native resolution (e.g., 1024×1024) using a pretrained diffusion model. This image is then upscaled in the image domain (using Lanczos interpolation) to the target high resolution. The choice of image-domain upscaling, as opposed to latent-space upscaling, is empirically justified by the avoidance of spatial artifacts due to the lack of scaling equivariance in standard VAEs.
Patch-wise DDIM Inversion and Wavelet-Based Detail Enhancement:

The upscaled image is encoded back into the latent space, and a patch-wise DDIM inversion is performed to recover the noise vectors corresponding to each patch. This deterministic inversion ensures that each patch’s initialization is consistent with the global structure, mitigating boundary artifacts and duplication. During the subsequent denoising process, a novel wavelet-based detail enhancer is applied. The discrete wavelet transform (DWT) is used to decompose each patch’s latent into low- and high-frequency components. The low-frequency bands, which encode global structure, are preserved from the base image, while the high-frequency bands are adaptively guided using a modified classifier-free guidance (CFG) strategy to synthesize new details. This frequency-aware guidance is critical for balancing global coherence and local detail.

Skip Residuals: To further preserve global structure, skip residuals are incorporated during the early denoising steps, blending the DDIM-inverted latents with those from the ongoing sampling process. This blending is phased out after a threshold, allowing the model to diverge and synthesize novel details in later steps.

Implementation and Practical Considerations

Patch Processing:

Patches are processed with 50% overlap and in batches, enabling 4096×4096 generation on consumer GPUs (24GB VRAM). This streaming approach is essential for memory efficiency.

Wavelet Choice and Guidance Strength:

The sym4 wavelet is used for DWT, balancing spatial and frequency localization. The high-frequency guidance strength is set to 7.5, empirically found to enhance detail without destabilizing structure.

Progressive Upscaling:

HiWave supports both single-step and progressive upscaling (e.g., 1024→2048→4096). The latter is shown to yield sharper details without introducing duplication, in contrast to prior reports for other methods.

Inference Time:

HiWave’s runtime is comparable to other patch-based methods (e.g., Pixelsmith), with 4096×4096 generation taking ~26 minutes on an RTX 3090. This is slower than direct inference methods but justified by the superior visual quality.

Empirical Results

Qualitative Evaluation:

HiWave consistently outperforms both patch-based (Pixelsmith, DemoFusion) and direct inference (HiDiffusion, FouriScale) baselines. It eliminates object duplication and boundary artifacts, and produces globally coherent, detailed images at 4K and even 8K resolutions.

User Study:

In a blind A/B test with 32 image pairs at 4096×4096, HiWave was preferred over Pixelsmith in 81.2% of cases (548 total evaluations), with several prompts achieving unanimous preference.

Quantitative Metrics:

Standard metrics (FID, KID, IS, CLIP, LPIPS, HPS-v2) are reported, but the paper notes their limited reliability at high resolutions due to downsampling. HiWave achieves comparable or better scores than baselines, but the authors emphasize the necessity of human evaluation for perceptual quality at ultra-high resolutions.

Ablation Studies:

The importance of DWT-based frequency guidance and DDIM inversion is demonstrated. Removing DWT guidance or DDIM inversion leads to duplication artifacts and patch inconsistencies, respectively. Progressive upscaling is shown to improve detail without introducing artifacts.

Real Image Upscaling:

HiWave can enhance real photographs (not just generated images) by leveraging prompt conditioning and DDIM inversion, demonstrating generalization beyond synthetic data.

Theoretical and Practical Implications

HiWave’s approach demonstrates that pretrained diffusion models can be extended to ultra-high resolutions without retraining or architectural modification, provided that frequency-aware guidance and structure-preserving inversion are employed. The method leverages the inherent high-frequency priors of diffusion models, while explicitly controlling the synthesis of new details in a manner that avoids the pitfalls of prior patch-based and direct inference methods.

The use of wavelet-domain guidance is a notable innovation, enabling selective enhancement of details while maintaining global structure. This paradigm could be extended to other generative tasks where frequency separation is beneficial, such as video synthesis, super-resolution, and image editing.

Limitations and Future Directions

Computational Cost:

While feasible on high-end consumer GPUs, the method is slower than direct inference approaches. Further optimization, possibly via model distillation or more efficient patch scheduling, could improve practicality for production use.

Prompt Dependence for Real Images:

For real-image upscaling, prompt engineering is required to align the conditioning with the image content, which may limit automation.

Metric Limitations:

The inadequacy of current quantitative metrics for high-resolution evaluation is highlighted. There is a need for new metrics that better capture perceptual quality at large scales.

Scalability:

HiWave is shown to scale to 8K, but further work is needed to assess limits in terms of both quality and computational resources.

Extension to Video and 3D:

The frequency-aware, patch-based approach could be adapted for temporally consistent video generation or 3D content synthesis, where global coherence and local detail are equally critical.

Conclusion

HiWave provides a robust, training-free solution for high-resolution image synthesis with pretrained diffusion models. Its combination of patch-wise DDIM inversion and wavelet-based frequency guidance sets a new standard for artifact-free, globally coherent, and richly detailed image generation at 4K and beyond. The method’s practical impact is significant for domains requiring ultra-high-resolution outputs, such as digital content creation, advertising, and film production. Future research may focus on runtime optimization, improved evaluation metrics, and generalization to other modalities.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Tobias Vontobel (3 papers)
Seyedmorteza Sadat (9 papers)
Farnood Salehi (10 papers)
Romann M. Weber (12 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/kwangmoo_yi/status/1938394092679139672

https://twitter.com/_akhaliq/status/1938244419272294725