Papers
Topics
Authors
Recent
Search
2000 character limit reached

Best-of-Many Objective in Generative Modeling

Updated 9 April 2026
  • The paper introduces the Best-of-Many (BoM) objective, which selects the best sample from multiple attempts to improve quality and mitigate mode collapse.
  • BoM is a training and inference strategy that employs a winner-takes-all approach by maximizing over candidate outputs to boost diversity and match latent priors.
  • Empirical results show that integrating BoM into VAE-GAN frameworks and LLM evaluations leads to significant improvements in FID scores and Pass@$k$ performance.

The Best-of-Many (BoM) objective is a class of training and inference strategies in generative modeling, sequence modeling, and LLM evaluation designed to favor accuracy, diversity, and robustness. The BoM principle departs from traditional expectation- or average-based training losses and single-shot inference by explicitly maximizing (or minimizing) over a set of candidate samples or outputs. Central to BoM is the allocation of multiple “attempts” to either match a target (in generative modeling) or surpass a threshold of quality/performance (in evaluation), with a suitably chosen aggregation—typically, a maximum for likelihood-based training or a composite of frequency and reward for inference. BoM objectives have been applied to address challenges of mode collapse in generative modeling, over-penalization of latent variances in variational frameworks, and suboptimal scaling in LLM Pass@kk inference settings (Bhattacharyya et al., 2019, Di et al., 3 Oct 2025, Bhattacharyya et al., 2018).

1. Mathematical Definition and Modeling Formulations

Let qϕ(zx)q_\phi(z\mid x) denote an encoder producing a latent zz given data xx, and pθ(xz)p_\theta(x\mid z) the generative decoder or likelihood. In training generative models, the BoM objective computes kk independent reconstructions per data point: z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i) and sets the central training signal as

BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i

or, when pθp_\theta is a parameterized Gaussian or Laplace,

BoMRec(x)=max1ik[λxGθ(zi)n]\text{BoM}_\text{Rec}(x) = \max_{1 \leq i \leq k} [ -\lambda \|x - G_\theta(z_i)\|_n ]

This “winner-takes-all” formulation is distinct from expectation-based objectives such as qϕ(zx)q_\phi(z\mid x)0. BoM thus allows the posterior qϕ(zx)q_\phi(z\mid x)1 to maintain significant entropy, since only the best sample is used for reconstruction, encouraging diversity and mode coverage (Bhattacharyya et al., 2019, Bhattacharyya et al., 2018).

In evaluation and inference settings such as LLM Pass@qϕ(zx)q_\phi(z\mid x)2, BoM variants operate by sampling qϕ(zx)q_\phi(z\mid x)3 candidate responses qϕ(zx)q_\phi(z\mid x)4. Given a reward model qϕ(zx)q_\phi(z\mid x)5 and a user-specified qϕ(zx)q_\phi(z\mid x)6, the BoM strategy first filters candidates via an empirical frequency threshold qϕ(zx)q_\phi(z\mid x)7: qϕ(zx)q_\phi(z\mid x)8 and then selects the top-qϕ(zx)q_\phi(z\mid x)9 candidates in zz0 according to the reward model. This balances majority robustness and reward-model leverage (Di et al., 3 Oct 2025).

2. Theoretical Motivation and Implications

Conventional VAE or CVAE frameworks maximize a reconstruction expectation, heavily penalizing any posterior that fails to concentrate zz1 near a single mode. This drives zz2 towards being deterministic (a Dirac delta), creating difficulties matching the latent prior zz3. GAN-based decoders may produce sharp samples but are vulnerable to mode collapse.

By contrast, the BoM objective provides zz4 with zz5 “attempts” to produce a high-likelihood latent, relaxing the requirement for every sample to explain zz6 well. The posterior can spread mass over multiple modes while ensuring at least one sample per training instance is a close match, thus achieving both sharp reconstructions and better latent-prior matching (Bhattacharyya et al., 2019, Bhattacharyya et al., 2018). This approach also underpins the theoretical optimality of BoM in inference scaling: it is minimax-optimal for Pass@zz7 regret, providing the best-known asymptotic scaling with zz8 and sampling budget zz9 under model error (Di et al., 3 Oct 2025).

3. Integration in Hybrid Objectives and Algorithmic Details

The BoM principle can be directly embedded in hybrid VAE–GAN frameworks. In these approaches, the overall objective combines:

  • a reconstruction term [BoM via xx0],
  • a synthetic likelihood adversarial term (using an image or sample critic xx1),
  • and a KL divergence term for latent-prior matching.

The joint BoM-VAE-GAN loss reads: xx2 with xx3 as scaling coefficients. In mini-batch SGD, the log-sum-exp over xx4 samples is approximated by its extremum (plus a constant shift of xx5), simplifying the computation (Bhattacharyya et al., 2019).

A typical computational pipeline (training loop) involves:

  • Drawing xx6 latents per input;
  • Computing the batch of reconstructions and log-likelihoods;
  • Identifying the max-likelihood sample per input;
  • Backpropagating only through the winning reconstruction (plus the KL term);
  • Adversarial updates via a spectral-normalized critic (for synthetic likelihood), and a discriminator in latent space.

In Pass@xx7 inference, BoM utilizes a two-step pseudocode:

  1. Filter outputs by frequency (xx8 where xx9 is the reference policy's coverage coefficient);
  2. Among candidates meeting threshold, select the best pθ(xz)p_\theta(x\mid z)0 by reward.

4. Empirical Results and Algorithmic Variants

Empirical studies have validated BoM across both generative and inference tasks.

For generative modeling:

  • Synthetic multimodal data: BoM-VAE-GAN (with pθ(xz)p_\theta(x\mid z)1) achieves nearly pθ(xz)p_\theta(x\mid z)2 mode coverage and high sample quality versus pθ(xz)p_\theta(x\mid z)3–pθ(xz)p_\theta(x\mid z)4 for baselines (Bhattacharyya et al., 2019).
  • CIFAR-10: FID improves from pθ(xz)p_\theta(x\mid z)5 (DCGAN), pθ(xz)p_\theta(x\mid z)6 (pθ(xz)p_\theta(x\mid z)7-GAN), to pθ(xz)p_\theta(x\mid z)8 (BoM-VAE-GAN, pθ(xz)p_\theta(x\mid z)9), and kk0 for strong CNN+spectral-norm at kk1.
  • CelebA 64x64: FID scores decrease from kk2 (SN-GAN), kk3 (kk4-GAN), kk5 (kk6-GAN+SN, kk7), to kk8 (BoM, kk9).

In sequence prediction, BoM lowers negative log-likelihood by z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)0–z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)1 nats over CVAE and achieves visible gains in sample diversity and sharpness (Bhattacharyya et al., 2018).

In Pass@z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)2 inference for LLMs:

  • On math benchmarks (GSM8K, MATH-500, AIME24), BoM outperforms majority voting and Best-of-z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)3, especially at small z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)4.
  • BoM performance does not degrade as z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)5 grows—a key property (scaling-monotonicity) not achieved by baselines.
  • Absolute Pass@z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)6 increases of z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)7–z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)8 over the best baseline in hard settings (Di et al., 3 Oct 2025).

5. Hyperparameter Tuning and Practical Considerations

Practical deployment of BoM objectives involves specifying:

  • Sample size z1,,zkqϕ(zx),i=logpθ(xzi)z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)9 (for generative BoM, typically BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i0 for image models; BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i1–BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i2 for sequence prediction; larger BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i3 trades compute for diversity but diminishing gains beyond BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i4);
  • Learning rates and optimizer (ADAM: e.g., BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i5 for generators, BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i6 for discriminators, BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i7, BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i8);
  • Architectural details (ResNet/CNN with spectral norm for image critic; latent dimensions BoMRec(x):=max1iki\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i9 for CIFAR-10/CelebA);
  • Hinge-loss thresholds (pθp_\theta0, pθp_\theta1) and spectral norm scaling (Lipschitz constant pθp_\theta2).

In the Pass@pθp_\theta3 BoM scheme, threshold pθp_\theta4 is chosen as pθp_\theta5 for balancing reward-model reliability against coverage. Sample size pθp_\theta6 ensures sufficient candidate diversity and robust regret bounds.

A key limitation of BoM in generative models is computational cost: pθp_\theta7 forward passes per training instance. The non-differentiable max is implemented by backpropagating only through the winner, potentially increasing gradient variance early in training (Bhattacharyya et al., 2018).

BoM objectives are closely related to but distinct from:

  • Minimum-over-samples (MoS) or multiple-choice learning, which lack latent-prior KL regularization and can overfit or scatter mass arbitrarily.
  • Importance-weighted autoencoder (IWAE), which weights all samples softly and can still suffer from under-diversification.
  • Standard MLE, which treats only point estimates and fails in multimodal settings.

BoM uniquely combines hard sample selection (maximizing the best attempt) with posterior regularization, ensuring its suitability for both capturing distributional diversity and maintaining proper marginalization in variational frameworks (Bhattacharyya et al., 2019, Bhattacharyya et al., 2018).

In inference, theoretical guarantees establish that BoM achieves minimax-optimal regret (matching lower and upper bounds), whereas majority voting and Best-of-pθp_\theta8 strategies fail to scale with pθp_\theta9 and BoMRec(x)=max1ik[λxGθ(zi)n]\text{BoM}_\text{Rec}(x) = \max_{1 \leq i \leq k} [ -\lambda \|x - G_\theta(z_i)\|_n ]0. BoM is thus the only scaling-monotonic algorithm under realistic reward model error, making it essential for high-reliability Pass@BoMRec(x)=max1ik[λxGθ(zi)n]\text{BoM}_\text{Rec}(x) = \max_{1 \leq i \leq k} [ -\lambda \|x - G_\theta(z_i)\|_n ]1 evaluation in large-scale LLM deployment (Di et al., 3 Oct 2025).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Best-of-Many (BoM) Objective.