Next Visual Granularity Generation (2508.12811v1)

Published 18 Aug 2025 in cs.CV, cs.AI, and cs.LG

Abstract: We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a novel framework that decouples visual structure from content, enabling stage-wise, fine-grained image generation.
The paper details a multi-stage quantized autoencoder with iterative clustering to build hierarchical structure maps that enhance control and fidelity.
The paper demonstrates competitive ImageNet results with fewer training steps, highlighting its practical advantages for scalable and controllable image synthesis.

Structured Visual Granularity for Image Generation: The NVG Framework

The "Next Visual Granularity Generation" paper introduces a novel framework for image generation that explicitly models hierarchical visual structure by decomposing images into sequences of increasing granularity. The NVG approach leverages a multi-stage quantized autoencoder and a structure-aware generation pipeline, enabling fine-grained control over both content and structure at each stage. This essay provides a technical summary of the NVG framework, its implementation, empirical results, and implications for future research.

Hierarchical Visual Granularity: Representation and Construction

NVG represents images as structured sequences, where each stage corresponds to a specific level of visual granularity. At each stage, the image is encoded with a fixed spatial resolution but a varying number of unique tokens, capturing progressively finer details. The construction of this sequence is fully data-driven, employing iterative clustering in the latent space to merge visually similar tokens and form hierarchical structure maps.

The multi-granularity quantized autoencoder encodes an image into a latent tensor $\bm{Z} \in \mathbb{R}^{h \times w \times e}$ , which is then quantized into content tokens $\bm{c}_i$ and structure maps $\bm{s}_i$ for each stage. The clustering process reduces the number of unique tokens by half at each stage, forming a hierarchy from coarse foreground/background separation to fine object details.

Figure 1: Construction of the visual granularity sequence on a $256^2$ image and the next visual granularity generation in the $16^2$ latent space. Top-to-bottom: Number of unique tokens, structure map, generated image.

A compact hierarchical structure embedding is introduced, encoding parent-child relationships and stage information in a $K$ -dimensional bit-style vector. This embedding is RoPE-compatible and efficiently constructed from class and stage IDs.

Figure 2: $K$ -dimensional structure embedding encodes hierarchical relationships, with padding and bit patterns distinguishing clusters and stages.

NVG Generation Pipeline: Structure and Content Decoupling

The NVG generation pipeline operates in a coarse-to-fine manner, iteratively generating structure maps and content tokens at each stage. Structure generation is treated as a conditional inpainting task, leveraging a lightweight rectified flow model to produce binary cluster maps. Content generation refines the canvas by predicting the quantization error between the current and final latent representations, using a structure-aware RoPE to encode hierarchical relationships.

Figure 3: NVG generation pipeline: at each stage, structure is generated first, followed by content, both conditioned on text, current canvas, and hierarchical structure.

The content generator is trained to predict the final canvas, with the difference serving as the quantization target for the current stage. This approach unifies training objectives and mitigates overfitting. Sampling strategies balance diversity and fidelity by adjusting the candidate pool and CFG scale across stages.

Empirical Results: Quantitative and Qualitative Analysis

NVG models are evaluated on class-conditional ImageNet generation, demonstrating clear scaling behavior and competitive performance against state-of-the-art GAN, diffusion, autoregressive, and cascaded models. NVG consistently outperforms VAR in FID, Inception Score, and recall, with fewer training steps and parameters.

Figure 4: Visualization of generated images. Top: Iterative generation process. Middle: Binary structure maps align with final images. Bottom: NVG- $d24$ generates diverse, high-quality images.

Ablation studies confirm the effectiveness of autoregressive content modeling, partial noise in structure inputs, and structure-aware RoPE. Direct prediction of next content leads to overfitting, while final canvas prediction provides richer supervision.

NVG supports explicit structure-guided generation, enabling control via geometric or semantic structure maps without additional training.

Figure 5: Structure-guided generation: images generated based on geometric binary structure maps and reference structure maps.

Stage-wise controlled generation demonstrates that fixing structure and content at early stages constrains layout and semantics, while later stages refine appearance details. NVG exhibits strong error-correction ability, generating plausible outputs even under out-of-domain class guidance.

Figure 6: Stage-wise controlled generation: top row reconstructs reference image at each stage; middle and bottom rows show preservation of content and structure, with in-domain and out-of-domain class guidance.

Implementation Details and Scaling Considerations

The NVG framework employs a self-attention backbone with parallel linear layers, scaling model width and depth according to stage. The structure generator is significantly smaller than the content generator, reflecting the lower complexity of structure embeddings. Training leverages WSD learning rate scheduling, DINO-based discriminator, and IBQ codebook initialization. The autoencoder and generator are trained on ImageNet, with generation steps set to $n=25$ for structure.

NVG achieves superior reconstruction quality with fewer unique tokens and a smaller codebook compared to VAR and other tokenizers, indicating efficient quantization and balanced codebook utilization.

Theoretical and Practical Implications

NVG addresses a key limitation of existing generative models by explicitly modeling hierarchical structure, rather than treating images as flat or unstructured data. The separation of structure and content enables direct, interpretable control over generation, facilitating applications in design, scientific visualization, and scenarios requiring hierarchical reasoning.

The framework's scalability and empirical performance suggest its suitability for large-scale, controllable generative systems. NVG's structure-aware approach integrates control mechanisms into pretraining, obviating the need for post-hoc modules.

Future Directions

NVG opens several avenues for research:

Region-Aware Generation: Direct generation using domain-specific granularity sequences, enabling fine-grained regional control.
Physical-Aware Video Generation: Tracking structured regions over time for coherent, physically realistic video synthesis.
Hierarchical Spatial Reasoning: Structured divide-and-conquer reasoning chains for spatial tasks, extending beyond patch-wise approaches.

Conclusion

The NVG framework advances image generation by modeling hierarchical visual granularity and decoupling structure from content. Its data-driven construction, structure-aware generation pipeline, and empirical results demonstrate improved fidelity, scalability, and controllability. NVG provides a foundation for future research in structured, interpretable, and region-aware generative modeling, with broad implications for both theoretical development and practical deployment in AI systems.

PDF Markdown

Follow-up Questions

Related Papers

Authors (6)

Tweets

https://twitter.com/_akhaliq/status/1957826948677148730

https://twitter.com/HuggingPapers/status/1957836902020612180

https://twitter.com/yikai97/status/1957666682543173798

https://twitter.com/javaeeeee1/status/1957753996652827014

Reddit

[2508.12811] Next Visual Granularity Generation (1 point, 0 comments)

alphaXiv

Next Visual Granularity Generation (18 likes, 0 questions)