Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 69 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 198 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

Distilling semantically aware orders for autoregressive image generation (2504.17069v1)

Published 23 Apr 2025 in cs.CV and cs.AI

Abstract: Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-LLMs. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.

Summary

Distilling Semantically Aware Orders for Autoregressive Image Generation

The research paper titled "Distilling Semantically Aware Orders for Autoregressive Image Generation" presents a novel framework for addressing the inherent challenge in autoregressive image generation regarding the sequential order of patch generation. Autoregressive models have been highly effective in text generation due to well-defined sequential orders dictated by the natural flow of language. However, the absence of similar intrinsic order in images, which are represented in a two-dimensional space, poses a unique challenge. Traditionally, a raster-scan order (top-left to bottom-right) is employed, but this paper posits that such an order is suboptimal as it fails to account for the semantic dependencies within the image content.

Methodology and Contributions

The paper proposes a method for semantically ordering patches during autoregressive image generation. The approach comprises three stages:

Training in Arbitrary Orders: Initially, the researchers train a model to generate image patches without a fixed order, allowing it to learn the joint distribution of patch sequences in a more flexible manner. This generates various orders to achieve a probabilistic understanding of the spatial relationships within image patches.
Order Extraction: By observing the generative process, the model subsequently infers an optimal generation order that respects the semantic content of the image. This extraction process identifies sequences where patches are generated based on semantic dependencies rather than structural order.
Fine-tuning with Inferred Orders: Finally, these distilled orders are utilized to fine-tune the autoregressive model, enhancing the generation quality by adhering to these semantically aware orders. This finetuning process reinforces the model's ability to generate images in a more perceptually coherent manner.

The experiments conducted on the Fashion Product dataset and the Multimodal CelebA-HQ dataset demonstrate improved performance over traditional raster-scan order generation. The method consistently achieves lower Fréchet Inception Distance (FID), Inception Score (IS), and Kernel Inception Distance (KID) across different datasets. These metrics indicate enhanced image quality and representational accuracy.

Technological Impact and Future Directions

This research represents a significant advance in autoregressive image generation, providing insights into the impact of generation order on model performance. By distilling semantically guided generation orders, the model captures contextual dependencies more effectively, leading to improved image synthesis that is closer to human expectations of semantic coherence. Such improvements could affect broader vision tasks requiring sequential data interpretation and manipulation, such as video frame prediction or adult image synthesis tasks.

The practical implications of this research extend into multimodal models where image and text generations require seamless integration. By harmonizing semantic dependencies across modalities, models could better align visual content with text descriptions, thus refining applications in areas like content creation or personalized advertising.

Future research might explore iterative refinement methods for generation orders, exploring the dynamic adaptation of generation sequences during training. Advances in using fully learned semantic orderings across larger datasets and higher resolution images may provide further scalability to this approach. Additionally, integrating these findings with emerging architectures, such as masked generation models or diffractive neural architectures, could enhance contextual understanding and prediction capability.

In summary, this paper contributes substantively to the field of autoregressive image generation by challenging assumptions about patch order. It opens new pathways for research and implementation, emphasizing the semantic coherence of image sequences for improved generative performance.