VAREdit: Visual Autoregressive Image Editing

Updated 4 July 2026

VAREdit is a visual autoregressive framework for instruction-guided editing that reformulates the process as next-scale prediction over discrete visual tokens.
It employs multi-scale residual prediction using a scale-aligned reference module to dynamically condition edits and preserve unedited regions.
Compared to diffusion-based methods, VAREdit achieves faster inference and improved adherence to editing instructions with fewer unintended modifications.

VAREdit is a visual autoregressive (VAR) framework for instruction-guided image editing that reformulates editing as a next-scale prediction problem over discrete visual tokens. Conditioned on source image features and text instructions, it generates multi-scale target features to achieve precise edits. The framework is motivated by the observation that diffusion-based editing uses a global denoising process that entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions, whereas autoregressive modeling provides a causal and compositional mechanism that naturally allows untouched regions to be preserved and edited regions to be updated precisely (Mao et al., 21 Aug 2025).

1. Position within instruction-guided image editing

VAREdit is situated within the transition from diffusion-based image editing to visual autoregressive editing. Diffusion systems such as InstructPix2Pix, UltraEdit, and ICEdit achieve high visual fidelity via iterative denoising, but their global denoising step couples local edits to the full image and sampling is also slow, taking several seconds per $512\times512$ edit. By contrast, autoregressive models decompose an image into discrete tokens and generate them token-by-token; VAR modeling further organizes this process as coarse-to-fine residual prediction across scales, with a single forward pass over multiple scales yielding orders-of-magnitude faster inference (Mao et al., 21 Aug 2025).

Within that setting, VAREdit adopts a next-scale prediction view of editing. Rather than treating editing as latent denoising or inversion, it conditions target generation directly on source-image features and textual instructions. The resulting design targets two properties simultaneously: editing adherence, meaning compliance with the instruction, and preservation of unedited content, meaning avoidance of spurious changes outside the intended edit region. The paper’s central claim is that a visual autoregressive formulation is structurally better aligned with this objective than diffusion’s globally entangled denoising dynamics (Mao et al., 21 Aug 2025).

2. Formalization as multi-scale residual prediction

VAREdit starts from a pre-trained VAR model, with Infinity given as an example. Let $I^{(src)}$ denote the source image, $t$ the text instruction, $\mathcal{E}, \mathcal{D}$ the VAR encoder and decoder, and $\mathcal{Q}$ the multi-scale quantizer. The source image is encoded as

$F^{(src)} = \mathcal{E}(I^{(src)}) \in \mathbb{R}^{H \times W \times d},$

and then quantized into multi-scale residual maps and features:

$R^{(src)}_{1:K}, F^{(src)}_{1:K} = \mathcal{Q}(F^{(src)}),$

with $R_k \in \mathbb{Z}^{h_k \times w_k}$ and

$F_k \triangleq Up(Lookup(R_k, C), (H,W)).$

The instruction is encoded by a text transformer into token embeddings $\{t_i\}$ and a pooled embedding $I^{(src)}$ 0. The editing task is then posed as conditional generation of target residual maps $I^{(src)}$ 1 given the source image and instruction:

$I^{(src)}$ 2

Training minimizes the summed negative log-likelihood across scales:

$I^{(src)}$ 3

This formulation makes the edit explicitly multi-scale. Coarse residuals establish global structure and later residuals refine details, so the edited image is constructed incrementally rather than globally perturbed in a single denoising trajectory. In the authors’ presentation, that decomposition is the core mechanism behind both efficiency and adherence (Mao et al., 21 Aug 2025).

3. Scale-Aligned Reference module

A central technical difficulty is how to condition target prediction on source-image tokens. The paper reports that conditioning only on the finest-scale source tokens is efficient but mismatched to coarse target prediction, while conditioning on all source scales is computationally costly because self-attention grows quadratically. The specific observation is that finest-scale source features cannot effectively guide the prediction of coarser target features (Mao et al., 21 Aug 2025).

VAREdit addresses this with the Scale-Aligned Reference (SAR) module. The design is based on a scale-dependency analysis of the first transformer self-attention layer: the model naturally attends to coarser source scales to establish global layout, while deeper layers focus locally. SAR therefore operates only in the first self-attention block. It dynamically downsamples the finest-scale source feature to match the current target scale $I^{(src)}$ 4:

$I^{(src)}$ 5

For target-token prediction at scale $I^{(src)}$ 6, the first self-attention layer forms queries, keys, and values as

$I^{(src)}$ 7

$I^{(src)}$ 8

$I^{(src)}$ 9

and computes

$t$ 0

Subsequent self-attention layers revert to conditioning only on the finest-scale source, together with textual cross-attention. The intended effect is to preserve efficiency while restoring correct multi-scale alignment: the first layer receives scale-matched source context for global layout, but the rest of the network retains the lighter conditioning path used for fast inference (Mao et al., 21 Aug 2025).

4. Autoregressive generation and training regime

Inference proceeds scale by scale. At each scale $t$ 1, VAREdit forms a cumulative feature from previously generated residuals,

$t$ 2

then downsamples it to initialize the next scale,

$t$ 3

The appended token sequence includes the finest-scale source tokens, $t$ 4, the previously generated target features $t$ 5, and a start-of-scale token. In the first self-attention layer SAR is applied; deeper layers use causal masking so that token $t$ 6 may attend only to tokens $t$ 7 plus all source tokens. The model then decodes the next-scale residual tokens $t$ 8 (Mao et al., 21 Aug 2025).

Training uses 3.92 M edit pairs from SEED-Data-Edit and ImgEdit, filtered by a VLM, Kimi-VL. Two model sizes are reported: 2.2B and 8.4B parameters, both initialized from Infinity. The 2.2B model is trained in two stages: 8 K iterations at $t$ 9 with batch size $\mathcal{E}, \mathcal{D}$ 0 and learning rate $\mathcal{E}, \mathcal{D}$ 1, followed by 7 K iterations at $\mathcal{E}, \mathcal{D}$ 2 with batch size $\mathcal{E}, \mathcal{D}$ 3 and learning rate $\mathcal{E}, \mathcal{D}$ 4. The 8.4B model is trained for 26 K iterations at $\mathcal{E}, \mathcal{D}$ 5, batch size $\mathcal{E}, \mathcal{D}$ 6, and learning rate $\mathcal{E}, \mathcal{D}$ 7. Optimization uses AdamW with weight decay $\mathcal{E}, \mathcal{D}$ 8 and $\mathcal{E}, \mathcal{D}$ 9. At inference time the model uses classifier-free guidance with strength $\mathcal{Q}$ 0 and temperature $\mathcal{Q}$ 1 (Mao et al., 21 Aug 2025).

5. Empirical performance, ablations, and qualitative behavior

The primary evaluation metric is GPT-Balance, defined as the harmonic mean of GPT-Success and GPT-Overedit scores from GPT-4o. Secondary metrics are CLIP-Out., CLIP-Dir., CLIP-Whole, and CLIP-Edit. Evaluation is reported on EMU-Edit with 3,589 samples and PIE-Bench with 700 samples. On EMU-Edit, InstructPix2Pix scores 2.92 GPT-Balance, UltraEdit 4.54, ICEdit 4.79, VAREdit-2.2B 5.66, and VAREdit-8.4B 6.77. On PIE-Bench, the corresponding scores are 4.03, 5.58, 4.93, 7.00, and 7.30. For $\mathcal{Q}$ 2 editing time, the reported values are 3.5 s for InstructPix2Pix, 2.6 s for UltraEdit, 8.4 s for ICEdit, 0.7 s for VAREdit-2.2B, and 1.2 s for VAREdit-8.4B. The paper states that VAREdit-8.4B achieves $\mathcal{Q}$ 3 on EMU-Edit and $\mathcal{Q}$ 4 on PIE-Bench in GPT-Balance over the strongest diffusion baseline, and is $\mathcal{Q}$ 5 faster than UltraEdit (Mao et al., 21 Aug 2025).

On CLIP-based metrics, VAREdit is reported as competitive or superior across image-level and region-level scores. The category-wise breakdown is described as state-of-the-art across object addition and removal, attribute change, and style or color transfer, with the best scaling properties in the paper’s radar plots. Qualitative examples show diffusion methods bleeding edits into the background or dropping requested changes, while vanilla AR editing with EditAR often fails to edit; VAREdit is described as applying precise, instruction-aligned changes while preserving unedited regions (Mao et al., 21 Aug 2025).

The SAR ablation isolates the contribution of scale-aligned conditioning. For the 2.2B model at 256 px, full-scale conditioning obtains EMU-Edit GPT-Balance 4.97 and PIE-Bench GPT-Balance 6.42 at relative time $\mathcal{Q}$ 6; finest-scale-only conditioning obtains 5.25 and 6.59 at $\mathcal{Q}$ 7; SAR obtains 5.57 and 6.68 at $\mathcal{Q}$ 8. The accompanying interpretation is explicit: full-scale conditioning is slow and over-edits, finest-scale-only conditioning is fast but mismatches coarse scales, and SAR recovers multi-scale alignment while boosting GPT-Overedit and overall balance (Mao et al., 21 Aug 2025).

The name “VAREdit” is not unique in the literature. In natural-language processing, “Variational Inference for Learning Representations of Natural Language Edits” presents a variational model for document-edit representations and is summarized as VAREdit, also referred to as EVE; its subject is latent edit representation learning for parallel document pairs rather than image editing (Marrese-Taylor et al., 2020). In visual generation, a later paper titled “Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models” also presents a VAREdit framework, based on a Next-Scale-Prediction VAR backbone, coarse-to-fine token localization, feature injection from source intermediate representations, and reinforcement-learning-based adaptive feature injection to optimize CLIP and SSIM jointly (Xia et al., 30 Mar 2026).

Subsequent VAR editing work also broadens the design space beyond VAREdit’s scale-aligned conditioning. “Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing” introduces VARIN, described as the first noise inversion-based editing technique designed explicitly for VAR models; it uses Location-aware Argmax Inversion to recover inverse Gumbel noises and then interpolates those noises during prompt-guided re-sampling (Dao et al., 2 Sep 2025). “Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models” introduces BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity that couples source-negative per-bit guidance in a Bernoulli-KL trust region with mask-gated residual re-injection in the native sum-of-scales code field (Zhang et al., 11 Jun 2026).

Taken together, these works indicate that visual autoregressive editing is not a single method family but a cluster of mechanisms operating at different representational levels: next-scale conditional feature generation in VAREdit, token localization and feature injection for structure preservation, inversion of discrete sampling noise, and bitwise or residual code manipulation. This suggests that the distinctive contribution of VAREdit in the 2025 sense is its reframing of instruction-guided editing as next-scale prediction with scale-aligned conditioning, rather than inversion-based control or mask-gated code arithmetic (Xia et al., 30 Mar 2026).