Location-aware Argmax Inversion (LAI)
- Location-aware Argmax Inversion (LAI) is a pseudo-inverse method that computes inverse noise for discrete token generation in visual autoregressive models.
- It employs label and truncated Gumbel sampling to ensure that the perturbed logits recover the original token assignment under a controlled margin (τ).
- LAI underpins VARIN by enabling precise source reconstruction and smooth prompt-guided image editing without additional model training.
Searching arXiv for the cited paper to ground the article in the latest record. Location-aware Argmax Inversion (LAI) is a pseudo-inverse function for argmax sampling introduced within VARIN, a noise inversion-based method for prompt-guided image editing in visual autoregressive models. It is designed for discrete token generation settings in which a model samples codebook indices by adding Gumbel noise to logits and then taking an argmax. Because argmax is many-to-one, the forward sampling step has no true inverse; LAI addresses this by constructing inverse noises that both recover the original token assignment under argmax and remain compatible with Gumbel-distributed sampling. In the formulation presented for next-scale autoregressive image generators such as VAR and HART, LAI is the mechanism that enables precise source reconstruction and controllable editing without additional training (Dao et al., 2 Sep 2025).
1. Problem setting and design objectives
Autoregressive visual models such as VAR or HART predict discrete token maps via a sequence of multinomial draws. At scale , the model produces unnormalized log-probabilities
draws independent Gumbel noises , forms perturbed logits
and selects tokens by
The inversion problem arises when one wishes to recover a noise-like latent description from a source image so that the autoregressive generator can later be steered by a new prompt while preserving source content. Since the argmax map is many-to-one, exact recovery of the original Gumbel noise is impossible from and alone. The objective is therefore to compute inverse noises such that has argmax equal to the original 0, while also retaining compatibility with genuine Gumbel noise. The paper formulates two requirements for such a pseudo-inverse: first, the extracted noises should resemble draws from 1; second, they should retain meaningful location-specific bias derived from the original logits 2, so that editing can preserve unedited content (Dao et al., 2 Sep 2025).
2. Formal definition and notation
LAI is described at a single scale 3, with the scale index omitted for brevity. The basic objects are:
- 4: number of token positions in the map.
- 5: size of the discrete codebook.
- 6: model logit tensor, where 7 is the log-unnormalized probability of class 8 at position 9.
- 0: ground-truth discrete labels extracted by the VAR-VAE encoder from the source image.
- 1: one-hot encoding of 2, with 3.
- 4: inverse-Gumbel noise to be computed, satisfying 5.
A scalar hyperparameter 6 controls the enforced gap between the sampled label logit and all other logits. Larger 7 imposes a stronger bias toward the source label. This parameter is central to the trade-off between preservation and randomness in the inversion procedure (Dao et al., 2 Sep 2025).
3. Construction of the pseudo-inverse
LAI proceeds position-wise in two stages. For a fixed position 8, it first isolates the logit corresponding to the source label: 9 It then samples a “label Gumbel” by
0
For each non-label class 1, LAI samples from a truncated Gumbel distribution: 2 The truncation enforces
3
which guarantees that the argmax remains the source label. The full vector is then assembled by setting the label entry to 4 and the remaining entries to the truncated samples, and the inverse noise is defined as
5
The operator 6 may be implemented by sampling 7 and returning
8
By construction, 9, and the resulting inverse noise consists of Gumbel or truncated-Gumbel draws centered around the original logits. This is the sense in which LAI satisfies both the distributional objective and the location-specific bias objective (Dao et al., 2 Sep 2025).
4. Algorithmic form, hyperparameters, and computational profile
In vectorized form, LAI takes as input the ground-truth labels 0, model logits 1, and truncation gap 2. The algorithm forms the one-hot mask 3, computes 4, samples 5, samples truncated Gumbels for all classes using truncation level 6, overwrites the label entry with 7, and returns 8. The inverse noise is then 9.
Two implementation details are explicitly noted. First, numerical stability in 0 may require clamping the argument of 1. Second, the choice of 2 trades off strict preservation against increased randomness; empirically, 3 works well in HART. The computational cost is 4, since LAI samples one label Gumbel and 5 non-label Gumbels per position, and the entire procedure is fully parallelizable on GPU (Dao et al., 2 Sep 2025).
A toy example in the paper illustrates the mechanism. For 6, logits 7, source token 8, and 9, one may sample 0, giving 1. If the truncated draws for the remaining classes are 2 and 3, then
4
The argmax remains class 5, while the inverse noise remains a plausible Gumbel-like perturbation.
5. Role within VARIN for text-based image editing
LAI is the inversion component of VARIN, where it is applied during the source-image inversion phase. Given a source image 6, the VAR-VAE encoder produces token maps 7. For each scale 8, the autoregressive model produces logits 9, and LAI0 generates 1, from which the inverse noise
2
is obtained. Collecting 3 yields a latent “noise fingerprint” that perfectly reconstructs the source when reused with 4.
During editing, for each scale 5, fresh random Gumbel noise 6 is interpolated with the inverse noise: 7 where 8 denotes the logits under the edited prompt and 9 is a scale-dependent mixing coefficient. The reported schedule is usually 0 at the start scale 1, then linearly decreasing to 2 by scale 3. Taking the argmax produces 4, and the VAR-VAE decoder reconstructs the edited image. In this formulation, LAI-extracted inverse noises keep sampling close to the source in unedited regions while permitting controlled deviations where the edited prompt exerts pressure (Dao et al., 2 Sep 2025).
6. Relation to alternative inversions, reported behavior, and limitations
LAI is contrasted directly with One-Hot Argmax Inversion (OAI), which sets
5
OAI guarantees that the source label is the argmax and recovers perfect reconstruction, but the implied noise 6 is not Gumbel distributed; the paper characterizes it as highly biased and degenerate, violating the requirement that inverse noise remain compatible with Gumbel sampling. LAI differs in two respects: it centers its sampling on the original logits 7, and it enforces a controllable margin 8 rather than a hard one-hot collapse.
The reported empirical behavior is that LAI yields inverse noises that both perfectly reconstruct the source, with zero structure distance at 9, and allow smooth editing when 0, thereby preserving background. In quantitative comparisons described for Table 1, VARIN + LAI outperforms OAI-based editing and discrete diffusion inversion baselines (DICE), while matching or exceeding continuous diffusion inversion methods in speed and background preservation. The paper also notes several limitations and open directions: LAI depends on choosing 1 per model; extreme 2 values can under-constrain or over-constrain inversion; truncated Gumbel sampling is more expensive than a single argmax; and a possible extension is to generalize the construction to top-3 or other structured sampling schemes via the Gumbel-Top-4 trick (Dao et al., 2 Sep 2025).
A common misconception is to regard LAI as an exact inverse of discrete sampling. The formulation does not support that interpretation: argmax remains non-invertible, and LAI is explicitly a pseudo-inverse that constructs surrogate perturbed logits 5 consistent with the observed label assignment. Its significance lies not in recovering the original noise realization, but in producing an invertible noise proxy that remains aligned with the model’s pre-softmax geometry and is therefore useful for training-free, prompt-guided editing in next-scale autoregressive image generators.