Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent Scale Rejection Sampling (LSRS)

Updated 10 December 2025
  • Latent Scale Rejection Sampling (LSRS) is a method that uses test-time rejection sampling in latent or hierarchical spaces to improve the quality and alignment of generative model outputs.
  • It employs a lightweight scoring network to rank and select candidate latent tokens based on global structure and class consistency, reducing error accumulation.
  • Empirical results show that LSRS significantly lowers FID scores and enhances image fidelity in both VAR and GAN models with only a marginal increase in computational cost.

Latent Scale Rejection Sampling (LSRS) is a family of test-time refinement methods designed to improve the quality and distributional alignment of samples from modern deep generative models. Two independent lines of LSRS have been developed: one targets Visual Autoregressive (VAR) models for hierarchical image generation (Zheng et al., 3 Dec 2025), while another addresses deficiencies in GAN sampling by leveraging importance-weight-based latent rejection (Issenhuth et al., 2021). Both instantiations of LSRS apply rigorous statistical selection in latent or hierarchical spaces at generation time, yielding samples with superior structure or higher fidelity, but using minimal additional computation.

1. Hierarchical Visual Autoregressive Generation and LSRS for VAR

Visual Autoregressive (VAR) models decompose images into a sequence of KK latent “scales” (r1,,rK)(r_1,\dots,r_K), where each scale is a 2D token map at increasing resolution. The likelihood factorizes hierarchically as

p(r1,,rK)=k=1Kp(rkr1,,rk1),p(r_1,\dots,r_K) = \prod_{k=1}^K p(r_k \mid r_1,\dots,r_{k-1}),

and at inference, tokens in each rkr_k are sampled independently and in parallel from p(rkr<k)p(r_k \mid r_{<k}). This factorization neglects intra-scale spatial dependencies, especially problematic on early (low-resolution) scales, where errors in global structure are propagated and compounded through subsequent refinement. Empirically, randomizing rkr_k in early stages of VAR generation destroys object and scene coherence (Zheng et al., 3 Dec 2025).

LSRS for VAR introduces a progressive, test-time rejection sampling mechanism in the latent scale domain. At every selected scale ss, msm_s candidate token maps rs(i)r_s^{(i)} are sampled in parallel. Each candidate is fused deterministically with the prefix by a multiscale VQ-VAE upsampler FF, producing feature maps es(i)e_s^{(i)}. A lightweight scoring network SS computes a scalar score for each (c,es(i),s)(c, e_s^{(i)}, s) triple, assessing both global structure and compatibility with the target class label (if applicable). The candidate with the highest score is selected to advance the generative chain. This best-of-msm_s selection at each scale, particularly at the earliest informative levels, has been shown to drastically reduce autoregressive error accumulation and produce sharper, more coherent images with minimal computational cost (Zheng et al., 3 Dec 2025).

2. Algorithmic Structure of LSRS for VAR

For VAR-based image generation, the formal LSRS procedure is as follows:

  • For each scale s=1,,Ks=1,\dots,K, compute p(rsr<s)p(r_s \mid r_{<s}).
  • Sample msm_s latent candidate maps rs(i)p(rsr<s)r_s^{(i)} \sim p(r_s \mid r_{<s}).
  • For each candidate, construct a feature map es(i)=F(r1,,rs1,rs(i))e_s^{(i)} = F(r_1,\dots,r_{s-1}, r_s^{(i)}).
  • Score each candidate as S(c,es(i),s)S(c, e_s^{(i)}, s); select argmaxiS\arg\max_i S.
  • Build the prefix iteratively by appending the chosen rsr_s to r<sr_{<s}.

Pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
Input: class c, pretrained VAR model, scoring net S, scale count K,
       sample counts {m_1,…,m_K}.
Initialize: empty prefix r_<1>=∅.
For s = 1 to K do
  1. Compute p(r_s | r_<s) via VAR.
  2. Draw candidates {r_s^(i)}_{i=1}^{m_s} ~ p(r_s | r_<s).
  3. For each i compute e_s^(i)=F(r_<s, r_s^(i)).
  4. Score_i ← S(c, e_s^(i), s).
  5. Select r_s ← argmax_i Score_i.
  6. Append r_s to prefix: r_<s+1>←r_<s>∪{r_s}.
End for
Return complete token maps {r_1,…,r_K}.

An acceptance probability-based variant can be defined by

A(rs(i))=exp(αS(c,es(i),s))maxjexp(αS(c,es(j),s))A(r_s^{(i)}) = \frac{\exp(\alpha S(c, e_s^{(i)}, s))}{\max_j \exp(\alpha S(c, e_s^{(j)}, s))}

but greedy top-1 is generally sufficient (Zheng et al., 3 Dec 2025).

3. Scoring Model: Architecture and Optimization

The scoring model SS is a compact convolutional neural network, ingesting the fused feature ese_s, a class embedding hcR128h_c \in \mathbb{R}^{128}, and a scale embedding hsR128h_s \in \mathbb{R}^{128}. The backbone consists of several (4–6) residual blocks (each with two 3×3 conv–LeakyReLU–LayerNorm layers and one 1×1 conv skip). The abstract visual feature is pooled to 2×2×2562 \times 2 \times 256 and flattened. The concatenated [visual,hc,hs][\text{visual}, h_c, h_s] vector passes through a two-layer MLP to output S(c,es,s)S(c, e_s, s).

The model is supervised on pairs of (c,es,y)(c, e_s, y), where y=1y=1 for “real” (VQ-VAE codebook) maps and y=0y=0 for “generated” VAR samples. Binary cross-entropy and pairwise ranking losses are both supported; the pairwise approach slightly enhances FID on held-out scales. Optimization uses Adam (initial LR 3×1043\times10^{-4}), batch size 128, and a cosine decay schedule (Zheng et al., 3 Dec 2025).

4. Computational Trade-Offs and Efficiency

Let TVART_\text{VAR} denote vanilla VAR generation time. The additional cost of LSRS when applied on scales s=ST,,Ks=ST,\dots,K with ms=Mm_s = M candidates is

TLSRS=TVAR+s=STK[MTgen,s+MTscore,s]T_\text{LSRS} = T_\text{VAR} + \sum_{s=ST}^K [M \cdot T_{\text{gen},s} + M \cdot T_{\text{score},s}]

where Tscore,sT_{\text{score},s} is negligible compared to Tgen,sT_{\text{gen},s}, and empirical choices such as M=4128M=4\sim128 keep overall overhead minor. On ImageNet 256×256 with VAR-d30, vanilla FID is 1.95 and time 1.0×; LSRS with M=4,ST=2M=4, ST=2 achieves FID 1.78 at 1.01× time, while M=128M=128 yields FID 1.66 at 1.15× time. Similar trade-offs hold for other VAR and FlexVAR backbones (Zheng et al., 3 Dec 2025).

5. Empirical Results and Ablation Studies

Key quantitative findings for class-conditional ImageNet 256×256 generation:

Model FID ↓ Time ×
VAR-d30 1.95 1.00
+LSRS M=4M=4 1.78 1.01
+LSRS M=128M=128 1.66 1.15

Gains plateau beyond M=32M=32–$128$; too large MM (≥256) may decrease diversity. LSRS is most effective when applied from scale ST=2ST=2; using it only at s=1s=1 reduces diversity, while deferring further deteriorates FID due to uncorrected structural errors (Zheng et al., 3 Dec 2025). Qualitatively, LSRS corrects structural failures (e.g., malformed objects) that manifest in early scales and enhances local texture sharpness even where baseline VAR outputs are reasonable.

6. LSRS in GANs: Latent Importance Reweighting

A distinct LSRS instance is described as “latent rejection sampling” in GANs (Issenhuth et al., 2021). For a pre-trained generator Gθ:RdRDG_\theta: \mathbb{R}^d \to \mathbb{R}^D, an MLP wϕ:Rd[0,m]w^\phi: \mathbb{R}^d \to [0, m] learns to reweight the prior γ\gamma for importance. After adversarial training of wϕw^\phi to match the pushforward GθγϕG_\theta \sharp \gamma^\phi to the empirical data in Wasserstein-1 distance, rejection sampling draws zγz \sim \gamma and accepts it with probability wϕ(z)/mw^\phi(z)/m. Outputs x=Gθ(z)x=G_\theta(z) are then more likely to match true data, and the method shrinks both sample FID and earth mover’s distances in synthetic and high-dimensional tasks. This approach operates entirely in the latent space and is computationally less expensive than post-generator reranking or score-based sampling methods (Issenhuth et al., 2021).

7. Limitations and Future Directions

LSRS for VAR models is fundamentally limited by the discriminative accuracy of the scoring network, especially for large backbone models where the real-vs-generated gap narrows. The method’s aggressiveness (greedy top-1 selection) introduces a risk of diminished sample diversity, suggesting future work on temperature-based or stochastic rank-based selection. Adaptive allocation of sample count MM to scales or classes with higher generation difficulty is plausible. Universal, unconditional, or text-to-image scoring may extend the method’s generality. In the GAN setting, the reweighting network’s expressiveness is bounded by soft-clipping to prevent degenerate mode collapse, and in both lines, stochastic selection and diversity-aware modifications remain open research problems (Zheng et al., 3 Dec 2025, Issenhuth et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Scale Rejection Sampling (LSRS).