Papers
Topics
Authors
Recent
Search
2000 character limit reached

Location-aware Argmax Inversion (LAI)

Updated 4 July 2026
  • Location-aware Argmax Inversion (LAI) is a pseudo-inverse method that computes inverse noise for discrete token generation in visual autoregressive models.
  • It employs label and truncated Gumbel sampling to ensure that the perturbed logits recover the original token assignment under a controlled margin (τ).
  • LAI underpins VARIN by enabling precise source reconstruction and smooth prompt-guided image editing without additional model training.

Searching arXiv for the cited paper to ground the article in the latest record. Location-aware Argmax Inversion (LAI) is a pseudo-inverse function for argmax sampling introduced within VARIN, a noise inversion-based method for prompt-guided image editing in visual autoregressive models. It is designed for discrete token generation settings in which a model samples codebook indices by adding Gumbel noise to logits and then taking an argmax. Because argmax is many-to-one, the forward sampling step has no true inverse; LAI addresses this by constructing inverse noises that both recover the original token assignment under argmax and remain compatible with Gumbel-distributed sampling. In the formulation presented for next-scale autoregressive image generators such as VAR and HART, LAI is the mechanism that enables precise source reconstruction and controllable editing without additional training (Dao et al., 2 Sep 2025).

1. Problem setting and design objectives

Autoregressive visual models such as VAR or HART predict discrete token maps r1,,rKr_1,\dots,r_K via a sequence of multinomial draws. At scale tt, the model produces unnormalized log-probabilities

pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},

draws independent Gumbel noises gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1), forms perturbed logits

qt=pt+gt,q_t = p_t + g_t,

and selects tokens by

rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].

The inversion problem arises when one wishes to recover a noise-like latent description from a source image so that the autoregressive generator can later be steered by a new prompt while preserving source content. Since the argmax map is many-to-one, exact recovery of the original Gumbel noise is impossible from rtr_t and ptp_t alone. The objective is therefore to compute inverse noises ntn_t such that pt+ntp_t + n_t has argmax equal to the original tt0, while also retaining compatibility with genuine Gumbel noise. The paper formulates two requirements for such a pseudo-inverse: first, the extracted noises should resemble draws from tt1; second, they should retain meaningful location-specific bias derived from the original logits tt2, so that editing can preserve unedited content (Dao et al., 2 Sep 2025).

2. Formal definition and notation

LAI is described at a single scale tt3, with the scale index omitted for brevity. The basic objects are:

  • tt4: number of token positions in the map.
  • tt5: size of the discrete codebook.
  • tt6: model logit tensor, where tt7 is the log-unnormalized probability of class tt8 at position tt9.
  • pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},0: ground-truth discrete labels extracted by the VAR-VAE encoder from the source image.
  • pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},1: one-hot encoding of pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},2, with pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},3.
  • pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},4: inverse-Gumbel noise to be computed, satisfying pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},5.

A scalar hyperparameter pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},6 controls the enforced gap between the sampled label logit and all other logits. Larger pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},7 imposes a stronger bias toward the source label. This parameter is central to the trade-off between preservation and randomness in the inversion procedure (Dao et al., 2 Sep 2025).

3. Construction of the pseudo-inverse

LAI proceeds position-wise in two stages. For a fixed position pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},8, it first isolates the logit corresponding to the source label: pt=logpθ(r<t)R×C,p_t = \log p_\theta(\cdot \mid r_{<t}) \in \mathbb{R}^{\ell \times C},9 It then samples a “label Gumbel” by

gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)0

For each non-label class gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)1, LAI samples from a truncated Gumbel distribution: gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)2 The truncation enforces

gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)3

which guarantees that the argmax remains the source label. The full vector is then assembled by setting the label entry to gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)4 and the remaining entries to the truncated samples, and the inverse noise is defined as

gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)5

The operator gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)6 may be implemented by sampling gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)7 and returning

gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)8

By construction, gtGumbel(0,1)g_t \sim \mathrm{Gumbel}(0,1)9, and the resulting inverse noise consists of Gumbel or truncated-Gumbel draws centered around the original logits. This is the sense in which LAI satisfies both the distributional objective and the location-specific bias objective (Dao et al., 2 Sep 2025).

4. Algorithmic form, hyperparameters, and computational profile

In vectorized form, LAI takes as input the ground-truth labels qt=pt+gt,q_t = p_t + g_t,0, model logits qt=pt+gt,q_t = p_t + g_t,1, and truncation gap qt=pt+gt,q_t = p_t + g_t,2. The algorithm forms the one-hot mask qt=pt+gt,q_t = p_t + g_t,3, computes qt=pt+gt,q_t = p_t + g_t,4, samples qt=pt+gt,q_t = p_t + g_t,5, samples truncated Gumbels for all classes using truncation level qt=pt+gt,q_t = p_t + g_t,6, overwrites the label entry with qt=pt+gt,q_t = p_t + g_t,7, and returns qt=pt+gt,q_t = p_t + g_t,8. The inverse noise is then qt=pt+gt,q_t = p_t + g_t,9.

Two implementation details are explicitly noted. First, numerical stability in rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].0 may require clamping the argument of rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].1. Second, the choice of rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].2 trades off strict preservation against increased randomness; empirically, rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].3 works well in HART. The computational cost is rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].4, since LAI samples one label Gumbel and rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].5 non-label Gumbels per position, and the entire procedure is fully parallelizable on GPU (Dao et al., 2 Sep 2025).

A toy example in the paper illustrates the mechanism. For rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].6, logits rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].7, source token rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].8, and rt[i]=argmaxc{1,,C}qt[i,c].r_t[i] = \arg\max_{c \in \{1,\ldots,C\}} q_t[i,c].9, one may sample rtr_t0, giving rtr_t1. If the truncated draws for the remaining classes are rtr_t2 and rtr_t3, then

rtr_t4

The argmax remains class rtr_t5, while the inverse noise remains a plausible Gumbel-like perturbation.

5. Role within VARIN for text-based image editing

LAI is the inversion component of VARIN, where it is applied during the source-image inversion phase. Given a source image rtr_t6, the VAR-VAE encoder produces token maps rtr_t7. For each scale rtr_t8, the autoregressive model produces logits rtr_t9, and LAIptp_t0 generates ptp_t1, from which the inverse noise

ptp_t2

is obtained. Collecting ptp_t3 yields a latent “noise fingerprint” that perfectly reconstructs the source when reused with ptp_t4.

During editing, for each scale ptp_t5, fresh random Gumbel noise ptp_t6 is interpolated with the inverse noise: ptp_t7 where ptp_t8 denotes the logits under the edited prompt and ptp_t9 is a scale-dependent mixing coefficient. The reported schedule is usually ntn_t0 at the start scale ntn_t1, then linearly decreasing to ntn_t2 by scale ntn_t3. Taking the argmax produces ntn_t4, and the VAR-VAE decoder reconstructs the edited image. In this formulation, LAI-extracted inverse noises keep sampling close to the source in unedited regions while permitting controlled deviations where the edited prompt exerts pressure (Dao et al., 2 Sep 2025).

6. Relation to alternative inversions, reported behavior, and limitations

LAI is contrasted directly with One-Hot Argmax Inversion (OAI), which sets

ntn_t5

OAI guarantees that the source label is the argmax and recovers perfect reconstruction, but the implied noise ntn_t6 is not Gumbel distributed; the paper characterizes it as highly biased and degenerate, violating the requirement that inverse noise remain compatible with Gumbel sampling. LAI differs in two respects: it centers its sampling on the original logits ntn_t7, and it enforces a controllable margin ntn_t8 rather than a hard one-hot collapse.

The reported empirical behavior is that LAI yields inverse noises that both perfectly reconstruct the source, with zero structure distance at ntn_t9, and allow smooth editing when pt+ntp_t + n_t0, thereby preserving background. In quantitative comparisons described for Table 1, VARIN + LAI outperforms OAI-based editing and discrete diffusion inversion baselines (DICE), while matching or exceeding continuous diffusion inversion methods in speed and background preservation. The paper also notes several limitations and open directions: LAI depends on choosing pt+ntp_t + n_t1 per model; extreme pt+ntp_t + n_t2 values can under-constrain or over-constrain inversion; truncated Gumbel sampling is more expensive than a single argmax; and a possible extension is to generalize the construction to top-pt+ntp_t + n_t3 or other structured sampling schemes via the Gumbel-Top-pt+ntp_t + n_t4 trick (Dao et al., 2 Sep 2025).

A common misconception is to regard LAI as an exact inverse of discrete sampling. The formulation does not support that interpretation: argmax remains non-invertible, and LAI is explicitly a pseudo-inverse that constructs surrogate perturbed logits pt+ntp_t + n_t5 consistent with the observed label assignment. Its significance lies not in recovering the original noise realization, but in producing an invertible noise proxy that remains aligned with the model’s pre-softmax geometry and is therefore useful for training-free, prompt-guided editing in next-scale autoregressive image generators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Location-aware Argmax Inversion (LAI).