Optimizing Probability Simplex: From Top-K & Top-P (Nucleus) to Best-of-K Samplers
This presentation reveals how language model decoding can be understood as a unified convex optimisation problem on the probability simplex. Rather than treating greedy, softmax, top-k, and nucleus sampling as disparate heuristics, the work shows they are all special cases of a master optimization framework defined by regularizers and constraints. This perspective enables principled design of new decoders, exemplified by the Best-of-K sampler, which explicitly optimizes for multi-sample coverage and delivers substantial accuracy improvements in high-temperature sampling regimes.Script
Every time a language model generates a token, it is solving an optimization problem. Greedy decoding, softmax sampling, top-k, nucleus—they all look different on the surface, but this paper proves they are the same problem with different regularizers.
The authors formulate decoding as choosing a distribution that maximizes a score weighted by model preferences, minus a regularization penalty, subject to constraints. This single equation captures greedy, softmax, top-k, and nucleus sampling as special cases defined by which regularizer and constraint set you pick.
Let's see how the classics emerge from this framework.
Set the regularization strength to zero and you recover greedy decoding—a spike on the argmax. Choose negative entropy and you get softmax sampling, where temperature is just the regularization coefficient. Top-k and nucleus arise when you add support constraints that restrict which tokens can receive probability mass.
Solving these optimization problems on the simplex requires care. Standard gradient methods impose Euclidean geometry, which fights the probability constraint. Mirror ascent uses KL divergence and yields multiplicative, natural-gradient-like updates that respect simplex structure and remain stable even when regularizers lack closed forms.
This framework is not just retrospective—it enables systematic design of new decoders.
The authors introduce Best-of-K, or BoK, which targets a different objective: when you sample K times and rerank, you want good alternatives to appear at least once. The regularizer explicitly models coverage probability—one minus the chance a token is missed in all K draws—and balances it against staying close to the model's softmax distribution.
On Qwen models, BoK delivers up to 18.6 percent absolute accuracy improvement on MATH500 at high temperature, with gains most pronounced exactly where vanilla sampling would collapse. It converges in just a few mirror ascent iterations per token, adding negligible runtime, and the improvements hold across mathematical reasoning, science QA, and code generation.
This framework transforms decoding from a collection of heuristics into a principled design process. You select a regularizer that encodes the structure you want—sparsity, entropy, coverage—and let optimization do the rest. It bridges theory and practice, and extends naturally to sequence-level constraints, external verifiers, and retrieval augmentation.
Decoding has always been optimization—we just didn't write it down. This work gives us the notation, the tools, and a concrete example in Best-of-K that outperforms the defaults. To explore the full technical details and empirical results, visit EmergentMind.com.