Best-of-Many Objective in Generative Modeling

Updated 9 April 2026

The paper introduces the Best-of-Many (BoM) objective, which selects the best sample from multiple attempts to improve quality and mitigate mode collapse.
BoM is a training and inference strategy that employs a winner-takes-all approach by maximizing over candidate outputs to boost diversity and match latent priors.
Empirical results show that integrating BoM into VAE-GAN frameworks and LLM evaluations leads to significant improvements in FID scores and Pass@$k$ performance.

The Best-of-Many (BoM) objective is a class of training and inference strategies in generative modeling, sequence modeling, and LLM evaluation designed to favor accuracy, diversity, and robustness. The BoM principle departs from traditional expectation- or average-based training losses and single-shot inference by explicitly maximizing (or minimizing) over a set of candidate samples or outputs. Central to BoM is the allocation of multiple “attempts” to either match a target (in generative modeling) or surpass a threshold of quality/performance (in evaluation), with a suitably chosen aggregation—typically, a maximum for likelihood-based training or a composite of frequency and reward for inference. BoM objectives have been applied to address challenges of mode collapse in generative modeling, over-penalization of latent variances in variational frameworks, and suboptimal scaling in LLM Pass@ $k$ inference settings (Bhattacharyya et al., 2019, Di et al., 3 Oct 2025, Bhattacharyya et al., 2018).

1. Mathematical Definition and Modeling Formulations

Let $q_\phi(z\mid x)$ denote an encoder producing a latent $z$ given data $x$ , and $p_\theta(x\mid z)$ the generative decoder or likelihood. In training generative models, the BoM objective computes $k$ independent reconstructions per data point: $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ and sets the central training signal as

$\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$

or, when $p_\theta$ is a parameterized Gaussian or Laplace,

$\text{BoM}_\text{Rec}(x) = \max_{1 \leq i \leq k} [ -\lambda \|x - G_\theta(z_i)\|_n ]$

This “winner-takes-all” formulation is distinct from expectation-based objectives such as $q_\phi(z\mid x)$ 0. BoM thus allows the posterior $q_\phi(z\mid x)$ 1 to maintain significant entropy, since only the best sample is used for reconstruction, encouraging diversity and mode coverage (Bhattacharyya et al., 2019, Bhattacharyya et al., 2018).

In evaluation and inference settings such as LLM Pass@ $q_\phi(z\mid x)$ 2, BoM variants operate by sampling $q_\phi(z\mid x)$ 3 candidate responses $q_\phi(z\mid x)$ 4. Given a reward model $q_\phi(z\mid x)$ 5 and a user-specified $q_\phi(z\mid x)$ 6, the BoM strategy first filters candidates via an empirical frequency threshold $q_\phi(z\mid x)$ 7: $q_\phi(z\mid x)$ 8 and then selects the top- $q_\phi(z\mid x)$ 9 candidates in $z$ 0 according to the reward model. This balances majority robustness and reward-model leverage (Di et al., 3 Oct 2025).

2. Theoretical Motivation and Implications

Conventional VAE or CVAE frameworks maximize a reconstruction expectation, heavily penalizing any posterior that fails to concentrate $z$ 1 near a single mode. This drives $z$ 2 towards being deterministic (a Dirac delta), creating difficulties matching the latent prior $z$ 3. GAN-based decoders may produce sharp samples but are vulnerable to mode collapse.

By contrast, the BoM objective provides $z$ 4 with $z$ 5 “attempts” to produce a high-likelihood latent, relaxing the requirement for every sample to explain $z$ 6 well. The posterior can spread mass over multiple modes while ensuring at least one sample per training instance is a close match, thus achieving both sharp reconstructions and better latent-prior matching (Bhattacharyya et al., 2019, Bhattacharyya et al., 2018). This approach also underpins the theoretical optimality of BoM in inference scaling: it is minimax-optimal for Pass@ $z$ 7 regret, providing the best-known asymptotic scaling with $z$ 8 and sampling budget $z$ 9 under model error (Di et al., 3 Oct 2025).

3. Integration in Hybrid Objectives and Algorithmic Details

The BoM principle can be directly embedded in hybrid VAE–GAN frameworks. In these approaches, the overall objective combines:

a reconstruction term [BoM via $x$ 0],
a synthetic likelihood adversarial term (using an image or sample critic $x$ 1),
and a KL divergence term for latent-prior matching.

The joint BoM-VAE-GAN loss reads: $x$ 2 with $x$ 3 as scaling coefficients. In mini-batch SGD, the log-sum-exp over $x$ 4 samples is approximated by its extremum (plus a constant shift of $x$ 5), simplifying the computation (Bhattacharyya et al., 2019).

A typical computational pipeline (training loop) involves:

Drawing $x$ 6 latents per input;
Computing the batch of reconstructions and log-likelihoods;
Identifying the max-likelihood sample per input;
Backpropagating only through the winning reconstruction (plus the KL term);
Adversarial updates via a spectral-normalized critic (for synthetic likelihood), and a discriminator in latent space.

In Pass@ $x$ 7 inference, BoM utilizes a two-step pseudocode:

Filter outputs by frequency ( $x$ 8 where $x$ 9 is the reference policy's coverage coefficient);
Among candidates meeting threshold, select the best $p_\theta(x\mid z)$ 0 by reward.

4. Empirical Results and Algorithmic Variants

Empirical studies have validated BoM across both generative and inference tasks.

For generative modeling:

Synthetic multimodal data: BoM-VAE-GAN (with $p_\theta(x\mid z)$ 1) achieves nearly $p_\theta(x\mid z)$ 2 mode coverage and high sample quality versus $p_\theta(x\mid z)$ 3– $p_\theta(x\mid z)$ 4 for baselines (Bhattacharyya et al., 2019).
CIFAR-10: FID improves from $p_\theta(x\mid z)$ 5 (DCGAN), $p_\theta(x\mid z)$ 6 ( $p_\theta(x\mid z)$ 7-GAN), to $p_\theta(x\mid z)$ 8 (BoM-VAE-GAN, $p_\theta(x\mid z)$ 9), and $k$ 0 for strong CNN+spectral-norm at $k$ 1.
CelebA 64x64: FID scores decrease from $k$ 2 (SN-GAN), $k$ 3 ( $k$ 4-GAN), $k$ 5 ( $k$ 6-GAN+SN, $k$ 7), to $k$ 8 (BoM, $k$ 9).

In sequence prediction, BoM lowers negative log-likelihood by $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 0– $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 1 nats over CVAE and achieves visible gains in sample diversity and sharpness (Bhattacharyya et al., 2018).

In Pass@ $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 2 inference for LLMs:

On math benchmarks (GSM8K, MATH-500, AIME24), BoM outperforms majority voting and Best-of- $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 3, especially at small $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 4.
BoM performance does not degrade as $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 5 grows—a key property (scaling-monotonicity) not achieved by baselines.
Absolute Pass@ $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 6 increases of $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 7– $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 8 over the best baseline in hard settings (Di et al., 3 Oct 2025).

5. Hyperparameter Tuning and Practical Considerations

Practical deployment of BoM objectives involves specifying:

Sample size $z_1, \dots, z_k \sim q_\phi(z|x),\quad \ell_i = \log p_\theta(x|z_i)$ 9 (for generative BoM, typically $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 0 for image models; $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 1– $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 2 for sequence prediction; larger $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 3 trades compute for diversity but diminishing gains beyond $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 4);
Learning rates and optimizer (ADAM: e.g., $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 5 for generators, $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 6 for discriminators, $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 7, $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 8);
Architectural details (ResNet/CNN with spectral norm for image critic; latent dimensions $\text{BoM}_\text{Rec}(x) := \max_{1 \leq i \leq k} \ell_i$ 9 for CIFAR-10/CelebA);
Hinge-loss thresholds ( $p_\theta$ 0, $p_\theta$ 1) and spectral norm scaling (Lipschitz constant $p_\theta$ 2).

In the Pass@ $p_\theta$ 3 BoM scheme, threshold $p_\theta$ 4 is chosen as $p_\theta$ 5 for balancing reward-model reliability against coverage. Sample size $p_\theta$ 6 ensures sufficient candidate diversity and robust regret bounds.

A key limitation of BoM in generative models is computational cost: $p_\theta$ 7 forward passes per training instance. The non-differentiable max is implemented by backpropagating only through the winner, potentially increasing gradient variance early in training (Bhattacharyya et al., 2018).

BoM objectives are closely related to but distinct from:

Minimum-over-samples (MoS) or multiple-choice learning, which lack latent-prior KL regularization and can overfit or scatter mass arbitrarily.
Importance-weighted autoencoder (IWAE), which weights all samples softly and can still suffer from under-diversification.
Standard MLE, which treats only point estimates and fails in multimodal settings.

BoM uniquely combines hard sample selection (maximizing the best attempt) with posterior regularization, ensuring its suitability for both capturing distributional diversity and maintaining proper marginalization in variational frameworks (Bhattacharyya et al., 2019, Bhattacharyya et al., 2018).

In inference, theoretical guarantees establish that BoM achieves minimax-optimal regret (matching lower and upper bounds), whereas majority voting and Best-of- $p_\theta$ 8 strategies fail to scale with $p_\theta$ 9 and $\text{BoM}_\text{Rec}(x) = \max_{1 \leq i \leq k} [ -\lambda \|x - G_\theta(z_i)\|_n ]$ 0. BoM is thus the only scaling-monotonic algorithm under realistic reward model error, making it essential for high-reliability Pass@ $\text{BoM}_\text{Rec}(x) = \max_{1 \leq i \leq k} [ -\lambda \|x - G_\theta(z_i)\|_n ]$ 1 evaluation in large-scale LLM deployment (Di et al., 3 Oct 2025).

References:

"Best-of-Many-Samples" Distribution Matching (Bhattacharyya et al., 2019)
Best-of-Majority: Minimax-Optimal Strategy for Pass@ $\text{BoM}_\text{Rec}(x) = \max_{1 \leq i \leq k} [ -\lambda \|x - G_\theta(z_i)\|_n ]$ 2 Inference Scaling (Di et al., 3 Oct 2025)
Accurate and Diverse Sampling of Sequences based on a "Best of Many" Sample Objective (Bhattacharyya et al., 2018)