Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models (2210.15458v2)

Published 18 Oct 2022 in cs.CL, cs.LG, and stat.ML

Abstract: Decoding methods for LLMs often trade-off between diversity of outputs and parallelism of computation. Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. Alternatively, methods such as temperature sampling and its modifications (top-k sampling, nucleus sampling, typical decoding, and others), are embarrassingly parallel, but have no guarantees about duplicate samples. We present a framework for sampling according to an arithmetic code book implicitly defined by a LLM, compatible with common sampling variations, with provable beam diversity under certain conditions, as well as being embarrassingly parallel and providing unbiased and consistent expectations from the original model. We demonstrate the effectiveness of our approach on WMT machine translation, more than halving the standard deviation when estimating expected BLEU score reward, and closing the BLEU score gap between independent sampling and beam search by up to 63%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Luke Vilnis (20 papers)
  2. Yury Zemlyanskiy (12 papers)
  3. Patrick Murray (3 papers)
  4. Alexandre Passos (12 papers)
  5. Sumit Sanghai (15 papers)
Citations (5)