Latent Scale Rejection Sampling (LSRS)

Updated 10 December 2025

Latent Scale Rejection Sampling (LSRS) is a method that uses test-time rejection sampling in latent or hierarchical spaces to improve the quality and alignment of generative model outputs.
It employs a lightweight scoring network to rank and select candidate latent tokens based on global structure and class consistency, reducing error accumulation.
Empirical results show that LSRS significantly lowers FID scores and enhances image fidelity in both VAR and GAN models with only a marginal increase in computational cost.

Latent Scale Rejection Sampling (LSRS) is a family of test-time refinement methods designed to improve the quality and distributional alignment of samples from modern deep generative models. Two independent lines of LSRS have been developed: one targets Visual Autoregressive (VAR) models for hierarchical image generation (Zheng et al., 3 Dec 2025), while another addresses deficiencies in GAN sampling by leveraging importance-weight-based latent rejection (Issenhuth et al., 2021). Both instantiations of LSRS apply rigorous statistical selection in latent or hierarchical spaces at generation time, yielding samples with superior structure or higher fidelity, but using minimal additional computation.

1. Hierarchical Visual Autoregressive Generation and LSRS for VAR

Visual Autoregressive (VAR) models decompose images into a sequence of $K$ latent “scales” $(r_1,\dots,r_K)$ , where each scale is a 2D token map at increasing resolution. The likelihood factorizes hierarchically as

$p(r_1,\dots,r_K) = \prod_{k=1}^K p(r_k \mid r_1,\dots,r_{k-1}),$

and at inference, tokens in each $r_k$ are sampled independently and in parallel from $p(r_k \mid r_{<k})$ . This factorization neglects intra-scale spatial dependencies, especially problematic on early (low-resolution) scales, where errors in global structure are propagated and compounded through subsequent refinement. Empirically, randomizing $r_k$ in early stages of VAR generation destroys object and scene coherence (Zheng et al., 3 Dec 2025).

LSRS for VAR introduces a progressive, test-time rejection sampling mechanism in the latent scale domain. At every selected scale $s$ , $m_s$ candidate token maps $r_s^{(i)}$ are sampled in parallel. Each candidate is fused deterministically with the prefix by a multiscale VQ-VAE upsampler $F$ , producing feature maps $e_s^{(i)}$ . A lightweight scoring network $S$ computes a scalar score for each $(c, e_s^{(i)}, s)$ triple, assessing both global structure and compatibility with the target class label (if applicable). The candidate with the highest score is selected to advance the generative chain. This best-of- $m_s$ selection at each scale, particularly at the earliest informative levels, has been shown to drastically reduce autoregressive error accumulation and produce sharper, more coherent images with minimal computational cost (Zheng et al., 3 Dec 2025).

2. Algorithmic Structure of LSRS for VAR

For VAR-based image generation, the formal LSRS procedure is as follows:

For each scale $s=1,\dots,K$ , compute $p(r_s \mid r_{<s})$ .
Sample $m_s$ latent candidate maps $r_s^{(i)} \sim p(r_s \mid r_{<s})$ .
For each candidate, construct a feature map $e_s^{(i)} = F(r_1,\dots,r_{s-1}, r_s^{(i)})$ .
Score each candidate as $S(c, e_s^{(i)}, s)$ ; select $\arg\max_i S$ .
Build the prefix iteratively by appending the chosen $r_s$ to $r_{<s}$ .

Pseudocode:

Input: class c, pretrained VAR model, scoring net S, scale count K,
       sample counts {m_1,…,m_K}.
Initialize: empty prefix r_<1>=∅.
For s = 1 to K do
  1. Compute p(r_s | r_<s) via VAR.
  2. Draw candidates {r_s^(i)}_{i=1}^{m_s} ~ p(r_s | r_<s).
  3. For each i compute e_s^(i)=F(r_<s, r_s^(i)).
  4. Score_i ← S(c, e_s^(i), s).
  5. Select r_s ← argmax_i Score_i.
  6. Append r_s to prefix: r_<s+1>←r_<s>∪{r_s}.
End for
Return complete token maps {r_1,…,r_K}.

An acceptance probability-based variant can be defined by

$A(r_s^{(i)}) = \frac{\exp(\alpha S(c, e_s^{(i)}, s))}{\max_j \exp(\alpha S(c, e_s^{(j)}, s))}$

but greedy top-1 is generally sufficient (Zheng et al., 3 Dec 2025).

3. Scoring Model: Architecture and Optimization

The scoring model $S$ is a compact convolutional neural network, ingesting the fused feature $e_s$ , a class embedding $h_c \in \mathbb{R}^{128}$ , and a scale embedding $h_s \in \mathbb{R}^{128}$ . The backbone consists of several (4–6) residual blocks (each with two 3×3 conv–LeakyReLU–LayerNorm layers and one 1×1 conv skip). The abstract visual feature is pooled to $2 \times 2 \times 256$ and flattened. The concatenated $[\text{visual}, h_c, h_s]$ vector passes through a two-layer MLP to output $S(c, e_s, s)$ .

The model is supervised on pairs of $(c, e_s, y)$ , where $y=1$ for “real” (VQ-VAE codebook) maps and $y=0$ for “generated” VAR samples. Binary cross-entropy and pairwise ranking losses are both supported; the pairwise approach slightly enhances FID on held-out scales. Optimization uses Adam (initial LR $3\times10^{-4}$ ), batch size 128, and a cosine decay schedule (Zheng et al., 3 Dec 2025).

4. Computational Trade-Offs and Efficiency

Let $T_\text{VAR}$ denote vanilla VAR generation time. The additional cost of LSRS when applied on scales $s=ST,\dots,K$ with $m_s = M$ candidates is

$T_\text{LSRS} = T_\text{VAR} + \sum_{s=ST}^K [M \cdot T_{\text{gen},s} + M \cdot T_{\text{score},s}]$

where $T_{\text{score},s}$ is negligible compared to $T_{\text{gen},s}$ , and empirical choices such as $M=4\sim128$ keep overall overhead minor. On ImageNet 256×256 with VAR-d30, vanilla FID is 1.95 and time 1.0×; LSRS with $M=4, ST=2$ achieves FID 1.78 at 1.01× time, while $M=128$ yields FID 1.66 at 1.15× time. Similar trade-offs hold for other VAR and FlexVAR backbones (Zheng et al., 3 Dec 2025).

5. Empirical Results and Ablation Studies

Key quantitative findings for class-conditional ImageNet 256×256 generation:

Model	FID ↓	Time ×
VAR-d30	1.95	1.00
+LSRS $M=4$	1.78	1.01
+LSRS $M=128$	1.66	1.15

Gains plateau beyond $M=32$ –$128$; too large $M$ (≥256) may decrease diversity. LSRS is most effective when applied from scale $ST=2$ ; using it only at $s=1$ reduces diversity, while deferring further deteriorates FID due to uncorrected structural errors (Zheng et al., 3 Dec 2025). Qualitatively, LSRS corrects structural failures (e.g., malformed objects) that manifest in early scales and enhances local texture sharpness even where baseline VAR outputs are reasonable.

6. LSRS in GANs: Latent Importance Reweighting

A distinct LSRS instance is described as “latent rejection sampling” in GANs (Issenhuth et al., 2021). For a pre-trained generator $G_\theta: \mathbb{R}^d \to \mathbb{R}^D$ , an MLP $w^\phi: \mathbb{R}^d \to [0, m]$ learns to reweight the prior $\gamma$ for importance. After adversarial training of $w^\phi$ to match the pushforward $G_\theta \sharp \gamma^\phi$ to the empirical data in Wasserstein-1 distance, rejection sampling draws $z \sim \gamma$ and accepts it with probability $w^\phi(z)/m$ . Outputs $x=G_\theta(z)$ are then more likely to match true data, and the method shrinks both sample FID and earth mover’s distances in synthetic and high-dimensional tasks. This approach operates entirely in the latent space and is computationally less expensive than post-generator reranking or score-based sampling methods (Issenhuth et al., 2021).

7. Limitations and Future Directions

LSRS for VAR models is fundamentally limited by the discriminative accuracy of the scoring network, especially for large backbone models where the real-vs-generated gap narrows. The method’s aggressiveness (greedy top-1 selection) introduces a risk of diminished sample diversity, suggesting future work on temperature-based or stochastic rank-based selection. Adaptive allocation of sample count $M$ to scales or classes with higher generation difficulty is plausible. Universal, unconditional, or text-to-image scoring may extend the method’s generality. In the GAN setting, the reweighting network’s expressiveness is bounded by soft-clipping to prevent degenerate mode collapse, and in both lines, stochastic selection and diversity-aware modifications remain open research problems (Zheng et al., 3 Dec 2025, Issenhuth et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling (2025)

Latent reweighting, an almost free improvement for GANs (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Scale Rejection Sampling (LSRS).