Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 37 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 189 tok/s Pro

GPT OSS 120B 431 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation (2510.02307v1)

Published 2 Oct 2025 in cs.CV and cs.AI

Abstract: Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

Summary

The paper introduces NoiseShift, a training-free method that recalibrates noise levels to mitigate resolution-dependent exposure bias in diffusion models.
It employs a coarse-to-fine grid search to align the denoiser's conditioning with perceptual noise distributions, yielding significant FID and CLIP score improvements.
Empirical evaluations on models like Stable Diffusion and Flux-Dev demonstrate reduced artifacts and enhanced texture fidelity, enabling efficient low-resolution synthesis.

NoiseShift: Resolution-Aware Noise Recalibration for Low-Resolution Diffusion Image Synthesis

Motivation and Problem Statement

Diffusion models have established state-of-the-art performance in high-fidelity image synthesis, particularly in text-to-image generation. However, these models exhibit a pronounced degradation in image quality when tasked with generating images at resolutions lower than those used during their final training stages. This is a critical limitation for practical deployment scenarios that require computational efficiency, such as mobile inference, rapid prototyping, or large-scale sampling, where low-resolution outputs are desirable.

The core technical insight addressed in this work is that the standard noise schedulers used in diffusion models have non-uniform perceptual effects across resolutions. Specifically, the same absolute noise level $\sigma_t$ erases a disproportionately larger fraction of the perceptual signal in low-resolution images compared to high-resolution ones. This leads to a train-test mismatch: the model, when sampled at a lower resolution, is exposed to a noise distribution that is perceptually more destructive than during training, resulting in artifacts and loss of semantic fidelity.

Figure 1: At the same sampling noise level $\sigma_t$ , lower-resolution images experience more severe visual and structural corruption than high-resolution counterparts.

Analysis of Resolution-Dependent Exposure Bias

The paper provides a detailed empirical and theoretical analysis of the forward-reverse process misalignment in diffusion models when sampling at resolutions different from those seen during training. While the forward (noise addition) and reverse (denoising) processes are theoretically symmetric, in practice, this symmetry breaks at test time, especially at lower resolutions.

Figure 2: The forward and reverse processes diverge during test-time sampling, with the mean squared error between predicted and actual noisy images increasing as resolution decreases and at early sampling steps.

The authors demonstrate that, for a fixed $\sigma_t$ , the structural similarity index (SSIM) between the clean and noisy images degrades more rapidly at lower resolutions. This is attributed to the fact that each pixel in a low-resolution image encodes a larger region of semantic content, making the image more susceptible to perceptual corruption by noise. Consequently, the denoising model is forced to operate on inputs that are statistically and perceptually out-of-distribution relative to its training regime.

The NoiseShift Method

To address this resolution-dependent exposure bias, the paper introduces NoiseShift, a training-free, test-time calibration method that recalibrates the noise level conditioning of the denoiser based on the target resolution. The method does not require any modification to the model architecture or the sampling schedule and is compatible with existing pretrained diffusion models.

The key idea is to replace the nominal timestep $t$ (and its associated noise level $\sigma_t$ ) with a surrogate timestep $\tilde{t}$ , or equivalently, a calibrated noise level $\hat{\sigma}_t$ , such that the denoiser's conditioning better matches the perceptual noise distribution at the target resolution. This is achieved via a coarse-to-fine grid search that minimizes the one-step denoising error between the model's prediction and the ground-truth forward sample, performed once per resolution and cached for inference.

Figure 3: The default sampling noise schedule (gray) and the resolution-specific calibrated conditioning $\hat{\sigma}_t$ for SD3 and SD3.5. At lower resolutions, the optimal $\hat{\sigma}_t$ curves deviate upward, indicating a need for stronger conditioning.

At inference, the model samples using the original noise schedule but conditions the denoiser on the precomputed, resolution-specific $\hat{\sigma}_t^*$ values. This simple intervention realigns the reverse process with the true noise distribution, mitigating exposure bias and improving image quality at low resolutions.

Empirical Evaluation

NoiseShift is evaluated on multiple state-of-the-art flow-matching diffusion models, including Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, across a range of resolutions and datasets (LAION-COCO, CelebA). The method consistently yields significant improvements in FID and CLIP scores at low and intermediate resolutions, with negligible or no degradation at the default (training) resolution.

On LAION-COCO, NoiseShift improves SD3.5 by 15.89% and SD3 by 8.56% in FID on average at low resolutions.
On CelebA, improvements are 10.36% for SD3.5 and 5.19% for SD3 in FID.
For Flux-Dev, which already incorporates a resolution-aware scheduler, NoiseShift still provides measurable gains, demonstrating its complementarity.

Qualitative results further corroborate the quantitative findings, showing that NoiseShift reduces structural artifacts and enhances texture fidelity in low-resolution generations.

Figure 4: Qualitative comparison of Flux-Dev before and after applying NoiseShift on CelebA (left) and LAION-COCO (right).

Figure 5: Qualitative comparison of SD3.5 before and after applying NoiseShift on CelebA (top) and LAION-COCO (bottom).

Ablation and Robustness

The method's calibration is robust to the number of samples used during the grid search, with the optimal $\hat{\sigma}_t$ values converging rapidly even with a small calibration set. This ensures that the computational overhead of calibration is minimal and can be amortized across many generations.

Figure 6: Ablation studies on the number of samples used during calibration and the resulting new sigmas at 128 $\times$ 128 and 256 $\times$ 256.

Implications and Future Directions

NoiseShift exposes a fundamental limitation in the standard design of diffusion models: the assumption of resolution-invariant noise schedules is not valid for perceptual quality. The method demonstrates that simple, training-free test-time interventions can substantially improve cross-resolution generalization without retraining or architectural changes.

Practically, this enables the deployment of high-capacity diffusion models in resource-constrained environments, broadening their applicability. Theoretically, the work suggests that future diffusion model architectures and training regimes should explicitly account for the resolution-dependent perceptual effects of noise, potentially via joint training across resolutions or by learning resolution-conditional noise embeddings.

Further research could explore integrating NoiseShift with learned adapters, dynamic token routing, or resolution-specific fine-tuning, as well as extending the approach to other modalities (e.g., video, audio) where similar scale-dependent noise effects may arise.

Conclusion

NoiseShift provides a principled, efficient, and effective solution to the problem of resolution-dependent exposure bias in diffusion models. By recalibrating the denoising conditioning at test time, it achieves substantial improvements in low-resolution image generation quality, is compatible with existing models, and incurs negligible computational overhead. This work highlights the importance of perceptual alignment in generative modeling and opens new avenues for robust, scalable, and efficient diffusion-based synthesis across diverse deployment scenarios.