EditAR: Unified Conditional Generation with Autoregressive Models (2501.04699v1)

Published 8 Jan 2025 in cs.CV

Abstract: Recent progress in controllable image generation and editing is largely driven by diffusion-based methods. Although diffusion models perform exceptionally well in specific tasks with tailored designs, establishing a unified model is still challenging. In contrast, autoregressive models inherently feature a unified tokenized representation, which simplifies the creation of a single foundational model for various tasks. In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. To enhance the text-to-image alignment, we further propose to distill the knowledge from foundation models into the autoregressive modeling process. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods. Project page: https://jitengmu.github.io/EditAR/

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper presents a unified autoregressive framework that consolidates diverse image editing and translation tasks into one model.
It leverages a tokenized representation with VQ-Autoencoder, text encoders, and a transformer to perform conditional image generation.
The integration of a distillation loss and classifier-free guidance significantly enhances semantic alignment and image quality.

The paper "EditAR: Unified Conditional Generation with Autoregressive Models" (2501.04699) presents a novel autoregressive framework for unifying diverse conditional image generation tasks, including image editing and image-to-image translation (like depth-to-image, edge-to-image, and segmentation-to-image). The core idea is to leverage the inherent unified tokenized representation of autoregressive models to create a single model capable of handling various input modalities and generation objectives, contrasting with the prevalent diffusion-based methods which often require task-specific architectures or extensive fine-tuning for each task.

EditAR builds upon existing large-scale text-to-image autoregressive models, specifically generalizing the LlamaGen architecture. The model operates in a tokenized space: images are first converted into sequences of discrete tokens using a VQ-Autoencoder (composed of an encoder $\mathcal{E}_\mathcal{I}$ and decoder $\mathcal{D}_\mathcal{I}$ ). Text instructions are encoded into latent embeddings $c_{\mathcal{T}}$ via a text encoder $\mathcal{E}_{\mathcal{T}}$ .

The central component is an autoregressive transformer $\mathcal{F}$ that predicts the sequence of output image tokens $s$ based on the input image tokens $c_{\mathcal{I}_c}$ and text embeddings $c_\mathcal{T}$ . This is formulated as modeling the conditional probability $p(s_i | s_{<i}, c_\mathcal{T}, c_{\mathcal{I}_c})$ . Both the conditioned image tokens and target image tokens are fed into the transformer's input sequence, differentiated by distinct positional embeddings.

To support diverse image modalities (natural images, depth maps, canny edges, segmentation masks), the model utilizes specific text prompts that indicate the type of conditioning. For instance, a depth-to-image task might use the prompt "Given the depth, generate the image following the instruction: <INSTRUCTION>". For natural image editing, only the core instruction prompt is used.

A key enhancement introduced is a distillation loss $\mathcal{L}_{distill}$ . This loss aligns the feature space of the autoregressive transformer with that of a frozen vision foundation model, specifically DINOv2 $\mathcal{E}_{distill}$ . An alignment network $\mathcal{A}$ is used to match the feature dimensions. The distillation loss, $MSE(\mathcal{A}(\mathcal{F}(\cdot)), \mathcal{E}_{distill}(\cdot))$ , is added to the standard cross-entropy loss $\mathcal{L}_{CE}$ for token prediction: $\mathcal{L} = \mathcal{L}_{CE} + \lambda_{distill} \cdot \mathcal{L}_{distill}$ . This distillation helps the autoregressive model learn more general and semantically meaningful visual features beyond just predicting token indices.

For training, EditAR is trained in a fully supervised manner on a mixed dataset comprising large-scale image editing data (SEED-Data-Edit-Unsplash, PIPE) and image translation data (COCOStuff, MultiGen-20M). Images are resized to $512 \times 512$ , resulting in $1024$ tokens per image with a VQ-Autoencoder downsampling ratio of 16. The text encoder and distillation model are frozen. During training, dropout is applied to the conditional inputs (text and/or image) to preserve unconditional generation capabilities.

At inference, the model generates output tokens sequentially using the standard next-token prediction paradigm. Classifier-Free Guidance (CFG) is applied during sampling to improve image quality and alignment with both image and text conditions, using a guidance strength hyperparameter $\eta$ .

The practical applications of EditAR span a wide range, including:

Image Editing: Texture manipulation, object replacement/removal, local editing, modifying styles, colors, or materials based on text instructions and an input image.
Image Translation: Generating realistic images from structural inputs like Canny edges, depth maps, or semantic segmentation masks, guided by text.

Experiments demonstrate that EditAR achieves strong performance across these diverse tasks with a single model, often outperforming or being competitive with state-of-the-art task-specific diffusion models, particularly in image translation FID scores and maintaining a balance between editing quality and reconstruction fidelity in image editing. Ablation studies confirm that the distillation loss improves semantic alignment, and CFG is crucial for balancing reconstruction and text-image alignment.

Implementation considerations include:

The need for large-scale paired training data covering diverse tasks.
Utilizing a pre-trained VQ-Autoencoder and text encoder.
Initialization from a pre-trained text-to-image autoregressive model (LlamaGen).
Balancing data sources and hyperparameters ( $\lambda_{distill}$ , $\eta$ ) across different tasks.
Computational requirements are significant due to the large autoregressive transformer size (based on LlamaGen GPT-XL with 36 layers), requiring multiple high-end GPUs for training (e.g., 8 A100 GPUs).
The current framework is designed for single image conditioning, although the authors suggest it could be extended to multiple conditions.
Performance on non-rigid or 3D editing might be limited by current training data.

In summary, EditAR represents a promising step towards unified conditional image generation using autoregressive models, offering a single framework for tasks traditionally handled by specialized diffusion models and demonstrating competitive performance across diverse benchmarks.

PDF Markdown

Follow-up Questions

Related Papers

Authors (3)

GitHub

EditAR: Unified Conditional Generation with Autoregressive Models

Tweets

https://twitter.com/JitengMu/status/1877423558114050110

https://twitter.com/ai_bites/status/1877391779596481033

https://twitter.com/gm8xx8/status/1877201154653737457

https://twitter.com/taziku_co/status/1877907959323443389