Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 111 tok/s Pro

Kimi K2 161 tok/s Pro

GPT OSS 120B 412 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis (2509.10441v1)

Published 12 Sep 2025 in cs.CV

Abstract: Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.

Summary

The paper introduces InfGen, a novel framework that reimagines the VAE decoder into a transformer-based generator for any resolution output.
The paper leverages cross-attention and implicit neural positional embeddings to convert fixed-size latents into high-quality images with up to 10× faster 4K synthesis.
The paper demonstrates plug-and-play integration with latent diffusion models, achieving state-of-the-art metrics such as improved FID and recall across diverse resolutions.

InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis

Introduction and Motivation

The paper introduces InfGen, a novel framework for arbitrary-resolution image synthesis that fundamentally rethinks the decoder stage in latent diffusion models (LDMs). While prior work has focused on improving the generative modeling and sampling efficiency of the first stage (e.g., LDMs, DiT), the decoder—typically a VAE—has remained a bottleneck for high-resolution image generation due to its fixed output size and limited generative capacity. InfGen replaces the VAE decoder with a transformer-based generator capable of reconstructing images at any resolution from a fixed-size latent, enabling rapid, high-quality synthesis without retraining the underlying diffusion model.

Methodology

Two-Stage Generation Paradigm

InfGen operates within the standard two-stage paradigm of LDMs: (1) a generative model produces a compact latent representation, and (2) a decoder reconstructs the image from this latent. The key innovation is in the second stage, where InfGen decodes the fixed-size latent into images of arbitrary resolution via a single forward pass, leveraging a transformer-based architecture with cross-attention mechanisms.

Figure 1: The InfGen generator is trained in latent space to reconstruct images at any resolution and aspect ratio, and can be applied to various diffusion models during inference.

Architecture

Latent Conditioning: The fixed-size latent $z$ from the VAE encoder serves as the content prompt. Mask tokens, shaped according to the target resolution, act as queries in the transformer decoder.
Cross-Attention: The mask tokens interact with the latent via multi-head cross-attention, enabling the generator to synthesize images at arbitrary resolutions.
Implicit Neural Positional Embedding (INPE): To align spatial information between mask and latent tokens, INPE maps coordinates to high-frequency Fourier features and then to dynamic positional encodings via an implicit neural network. This enables seamless interaction between fixed-size latents and variable-size outputs.

Training and Optimization

Loss Functions: The training objective combines $L_1$ reconstruction loss, LPIPS perceptual loss, and PatchGAN adversarial loss.
Data: Training utilizes 10M high-resolution images from LAION-Aesthetic, with dynamic cropping and resizing to support arbitrary output sizes.
Frozen Encoder: The VAE encoder is kept frozen, ensuring compatibility with existing LDMs and minimizing retraining costs.

Training-Free Resolution Extrapolation

InfGen supports iterative extrapolation for ultra-high-resolution synthesis. Starting from a low-resolution latent, the output image is re-encoded and further upscaled in subsequent iterations, enabling robust generation at resolutions far beyond the training regime.

Experimental Results

Efficiency and Latency

InfGen dramatically reduces inference time for high-resolution synthesis. For 4K images, generation time drops from over 100 seconds (standard diffusion) to under 10 seconds, representing a $10\times$ speedup over the previous fastest method, UltraPixel.

Figure 2: Inference time (seconds per image) for high-resolution image generation methods. The vertical axis is a logarithmic scale.

Quantitative Performance

Tokenization Quality: InfGen achieves competitive reconstruction metrics (FID, PSNR, SSIM) compared to VQGAN, SD-VAE, and SDXL-VAE, despite handling more complex tasks.
Resolution Scaling: Across multiple LDMs (DiT, SiT, FiT, SD1.5), replacing the VAE decoder with InfGen yields consistent improvements in FID, sFID, precision, and recall at all tested resolutions, with especially pronounced gains at ultra-high resolutions (e.g., $44\%$ FID improvement at $3072\times3072$ ).
Benchmarking: InfGen+SDXL-B-1 achieves state-of-the-art FID and recall at $1024\times1024$ and $2048\times2048$ , with inference latency far below competing methods.

Qualitative Analysis

InfGen produces visually coherent, detail-rich images at arbitrary resolutions, outperforming baseline LDMs in semantic consistency and texture fidelity.

Figure 3: Visualizations of arbitrary image generation. InfGen improves the generation ability for LDMs across various resolutions.

Practical Implications

Plug-and-Play Integration

InfGen is designed as a drop-in replacement for VAE decoders in any LDM sharing the same latent space, requiring no retraining of the generative model. This enables rapid upgrades of existing diffusion pipelines to support arbitrary-resolution synthesis.

Resource Efficiency

By decoupling resolution from latent size and leveraging transformer-based decoding, InfGen achieves high-quality synthesis with minimal computational overhead, making it suitable for deployment in resource-constrained environments and real-time applications.

Generalization

The architecture supports diverse generative tasks, including class-guided, text-conditional, and inpainting, and is robust to both object-centric and scene-centric images.

Theoretical Implications and Future Directions

InfGen demonstrates that the decoder stage in LDMs can be reimagined as a generative process, not merely a reconstruction task. The use of cross-attention and INPE for arbitrary-resolution synthesis opens avenues for further research in scalable generative modeling, including:

Continuous Latent Spaces: Extending InfGen to support continuous latent representations and variable aspect ratios.
Multi-modal Generation: Integrating InfGen with multi-modal diffusion models for scalable video and 3D synthesis.
Adaptive Tokenization: Investigating dynamic latent allocation strategies for even greater efficiency and fidelity.

Conclusion

InfGen establishes a new paradigm for scalable, resolution-agnostic image synthesis by transforming the decoder stage of LDMs into a generative process. It delivers substantial improvements in both quality and efficiency, enabling arbitrary-resolution generation with minimal latency and broad compatibility. The framework's plug-and-play nature and strong empirical results suggest significant potential for advancing practical and theoretical research in high-resolution generative modeling.