Latent Scale Rejection Sampling (LSRS)
- Latent Scale Rejection Sampling (LSRS) is a method that uses test-time rejection sampling in latent or hierarchical spaces to improve the quality and alignment of generative model outputs.
- It employs a lightweight scoring network to rank and select candidate latent tokens based on global structure and class consistency, reducing error accumulation.
- Empirical results show that LSRS significantly lowers FID scores and enhances image fidelity in both VAR and GAN models with only a marginal increase in computational cost.
Latent Scale Rejection Sampling (LSRS) is a family of test-time refinement methods designed to improve the quality and distributional alignment of samples from modern deep generative models. Two independent lines of LSRS have been developed: one targets Visual Autoregressive (VAR) models for hierarchical image generation (Zheng et al., 3 Dec 2025), while another addresses deficiencies in GAN sampling by leveraging importance-weight-based latent rejection (Issenhuth et al., 2021). Both instantiations of LSRS apply rigorous statistical selection in latent or hierarchical spaces at generation time, yielding samples with superior structure or higher fidelity, but using minimal additional computation.
1. Hierarchical Visual Autoregressive Generation and LSRS for VAR
Visual Autoregressive (VAR) models decompose images into a sequence of latent “scales” , where each scale is a 2D token map at increasing resolution. The likelihood factorizes hierarchically as
and at inference, tokens in each are sampled independently and in parallel from . This factorization neglects intra-scale spatial dependencies, especially problematic on early (low-resolution) scales, where errors in global structure are propagated and compounded through subsequent refinement. Empirically, randomizing in early stages of VAR generation destroys object and scene coherence (Zheng et al., 3 Dec 2025).
LSRS for VAR introduces a progressive, test-time rejection sampling mechanism in the latent scale domain. At every selected scale , candidate token maps are sampled in parallel. Each candidate is fused deterministically with the prefix by a multiscale VQ-VAE upsampler , producing feature maps . A lightweight scoring network computes a scalar score for each triple, assessing both global structure and compatibility with the target class label (if applicable). The candidate with the highest score is selected to advance the generative chain. This best-of- selection at each scale, particularly at the earliest informative levels, has been shown to drastically reduce autoregressive error accumulation and produce sharper, more coherent images with minimal computational cost (Zheng et al., 3 Dec 2025).
2. Algorithmic Structure of LSRS for VAR
For VAR-based image generation, the formal LSRS procedure is as follows:
- For each scale , compute .
- Sample latent candidate maps .
- For each candidate, construct a feature map .
- Score each candidate as ; select .
- Build the prefix iteratively by appending the chosen to .
Pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 |
Input: class c, pretrained VAR model, scoring net S, scale count K,
sample counts {m_1,…,m_K}.
Initialize: empty prefix r_<1>=∅.
For s = 1 to K do
1. Compute p(r_s | r_<s) via VAR.
2. Draw candidates {r_s^(i)}_{i=1}^{m_s} ~ p(r_s | r_<s).
3. For each i compute e_s^(i)=F(r_<s, r_s^(i)).
4. Score_i ← S(c, e_s^(i), s).
5. Select r_s ← argmax_i Score_i.
6. Append r_s to prefix: r_<s+1>←r_<s>∪{r_s}.
End for
Return complete token maps {r_1,…,r_K}. |
An acceptance probability-based variant can be defined by
but greedy top-1 is generally sufficient (Zheng et al., 3 Dec 2025).
3. Scoring Model: Architecture and Optimization
The scoring model is a compact convolutional neural network, ingesting the fused feature , a class embedding , and a scale embedding . The backbone consists of several (4–6) residual blocks (each with two 3×3 conv–LeakyReLU–LayerNorm layers and one 1×1 conv skip). The abstract visual feature is pooled to and flattened. The concatenated vector passes through a two-layer MLP to output .
The model is supervised on pairs of , where for “real” (VQ-VAE codebook) maps and for “generated” VAR samples. Binary cross-entropy and pairwise ranking losses are both supported; the pairwise approach slightly enhances FID on held-out scales. Optimization uses Adam (initial LR ), batch size 128, and a cosine decay schedule (Zheng et al., 3 Dec 2025).
4. Computational Trade-Offs and Efficiency
Let denote vanilla VAR generation time. The additional cost of LSRS when applied on scales with candidates is
where is negligible compared to , and empirical choices such as keep overall overhead minor. On ImageNet 256×256 with VAR-d30, vanilla FID is 1.95 and time 1.0×; LSRS with achieves FID 1.78 at 1.01× time, while yields FID 1.66 at 1.15× time. Similar trade-offs hold for other VAR and FlexVAR backbones (Zheng et al., 3 Dec 2025).
5. Empirical Results and Ablation Studies
Key quantitative findings for class-conditional ImageNet 256×256 generation:
| Model | FID ↓ | Time × |
|---|---|---|
| VAR-d30 | 1.95 | 1.00 |
| +LSRS | 1.78 | 1.01 |
| +LSRS | 1.66 | 1.15 |
Gains plateau beyond –$128$; too large (≥256) may decrease diversity. LSRS is most effective when applied from scale ; using it only at reduces diversity, while deferring further deteriorates FID due to uncorrected structural errors (Zheng et al., 3 Dec 2025). Qualitatively, LSRS corrects structural failures (e.g., malformed objects) that manifest in early scales and enhances local texture sharpness even where baseline VAR outputs are reasonable.
6. LSRS in GANs: Latent Importance Reweighting
A distinct LSRS instance is described as “latent rejection sampling” in GANs (Issenhuth et al., 2021). For a pre-trained generator , an MLP learns to reweight the prior for importance. After adversarial training of to match the pushforward to the empirical data in Wasserstein-1 distance, rejection sampling draws and accepts it with probability . Outputs are then more likely to match true data, and the method shrinks both sample FID and earth mover’s distances in synthetic and high-dimensional tasks. This approach operates entirely in the latent space and is computationally less expensive than post-generator reranking or score-based sampling methods (Issenhuth et al., 2021).
7. Limitations and Future Directions
LSRS for VAR models is fundamentally limited by the discriminative accuracy of the scoring network, especially for large backbone models where the real-vs-generated gap narrows. The method’s aggressiveness (greedy top-1 selection) introduces a risk of diminished sample diversity, suggesting future work on temperature-based or stochastic rank-based selection. Adaptive allocation of sample count to scales or classes with higher generation difficulty is plausible. Universal, unconditional, or text-to-image scoring may extend the method’s generality. In the GAN setting, the reweighting network’s expressiveness is bounded by soft-clipping to prevent degenerate mode collapse, and in both lines, stochastic selection and diversity-aware modifications remain open research problems (Zheng et al., 3 Dec 2025, Issenhuth et al., 2021).