- The paper presents a unified autoregressive framework that consolidates diverse image editing and translation tasks into one model.
- It leverages a tokenized representation with VQ-Autoencoder, text encoders, and a transformer to perform conditional image generation.
- The integration of a distillation loss and classifier-free guidance significantly enhances semantic alignment and image quality.
The paper "EditAR: Unified Conditional Generation with Autoregressive Models" (2501.04699) presents a novel autoregressive framework for unifying diverse conditional image generation tasks, including image editing and image-to-image translation (like depth-to-image, edge-to-image, and segmentation-to-image). The core idea is to leverage the inherent unified tokenized representation of autoregressive models to create a single model capable of handling various input modalities and generation objectives, contrasting with the prevalent diffusion-based methods which often require task-specific architectures or extensive fine-tuning for each task.
EditAR builds upon existing large-scale text-to-image autoregressive models, specifically generalizing the LlamaGen architecture. The model operates in a tokenized space: images are first converted into sequences of discrete tokens using a VQ-Autoencoder (composed of an encoder EI and decoder DI). Text instructions are encoded into latent embeddings cT via a text encoder ET.
The central component is an autoregressive transformer F that predicts the sequence of output image tokens s based on the input image tokens cIc and text embeddings cT. This is formulated as modeling the conditional probability p(si∣s<i,cT,cIc). Both the conditioned image tokens and target image tokens are fed into the transformer's input sequence, differentiated by distinct positional embeddings.
To support diverse image modalities (natural images, depth maps, canny edges, segmentation masks), the model utilizes specific text prompts that indicate the type of conditioning. For instance, a depth-to-image task might use the prompt "Given the depth, generate the image following the instruction: <INSTRUCTION>". For natural image editing, only the core instruction prompt is used.
A key enhancement introduced is a distillation loss Ldistill. This loss aligns the feature space of the autoregressive transformer with that of a frozen vision foundation model, specifically DINOv2 Edistill. An alignment network A is used to match the feature dimensions. The distillation loss, MSE(A(F(⋅)),Edistill(⋅)), is added to the standard cross-entropy loss LCE for token prediction: L=LCE+λdistill⋅Ldistill. This distillation helps the autoregressive model learn more general and semantically meaningful visual features beyond just predicting token indices.
For training, EditAR is trained in a fully supervised manner on a mixed dataset comprising large-scale image editing data (SEED-Data-Edit-Unsplash, PIPE) and image translation data (COCOStuff, MultiGen-20M). Images are resized to 512×512, resulting in $1024$ tokens per image with a VQ-Autoencoder downsampling ratio of 16. The text encoder and distillation model are frozen. During training, dropout is applied to the conditional inputs (text and/or image) to preserve unconditional generation capabilities.
At inference, the model generates output tokens sequentially using the standard next-token prediction paradigm. Classifier-Free Guidance (CFG) is applied during sampling to improve image quality and alignment with both image and text conditions, using a guidance strength hyperparameter η.
The practical applications of EditAR span a wide range, including:
- Image Editing: Texture manipulation, object replacement/removal, local editing, modifying styles, colors, or materials based on text instructions and an input image.
- Image Translation: Generating realistic images from structural inputs like Canny edges, depth maps, or semantic segmentation masks, guided by text.
Experiments demonstrate that EditAR achieves strong performance across these diverse tasks with a single model, often outperforming or being competitive with state-of-the-art task-specific diffusion models, particularly in image translation FID scores and maintaining a balance between editing quality and reconstruction fidelity in image editing. Ablation studies confirm that the distillation loss improves semantic alignment, and CFG is crucial for balancing reconstruction and text-image alignment.
Implementation considerations include:
- The need for large-scale paired training data covering diverse tasks.
- Utilizing a pre-trained VQ-Autoencoder and text encoder.
- Initialization from a pre-trained text-to-image autoregressive model (LlamaGen).
- Balancing data sources and hyperparameters (λdistill, η) across different tasks.
- Computational requirements are significant due to the large autoregressive transformer size (based on LlamaGen GPT-XL with 36 layers), requiring multiple high-end GPUs for training (e.g., 8 A100 GPUs).
- The current framework is designed for single image conditioning, although the authors suggest it could be extended to multiple conditions.
- Performance on non-rigid or 3D editing might be limited by current training data.
In summary, EditAR represents a promising step towards unified conditional image generation using autoregressive models, offering a single framework for tasks traditionally handled by specialized diffusion models and demonstrating competitive performance across diverse benchmarks.