Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Call for Clarity in Beam Search: How It Works and When It Stops (2204.05424v3)

Published 11 Apr 2022 in cs.CL

Abstract: Text generation with beam search has proven successful in a wide range of applications. We point out that, though largely overlooked in the literature, the commonly-used implementation of beam decoding (e.g., Hugging Face Transformers and fairseq) uses a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. Based on this finding, we introduce a patience factor, a simple modification to this beam decoding implementation, that generalizes the stopping criterion and provides flexibility to the depth of search. Empirical results demonstrate that adjusting this patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation. Further, we find that different versions of beam decoding result in large performance differences in summarization, demonstrating the need for clarity in specifying the beam search implementation in research work. Our code will be available upon publication.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (177)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
  2. ETC: Encoding long and structured inputs in transformers. In Proc. of EMNLP.
  3. Findings of the 2021 conference on machine translation (WMT21). In Proc. of WMT.
  4. An empirical study on the properties of random bases for kernel methods. In Proc. of NeurIPS.
  5. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. of CVPR.
  6. Faster kernel ridge regression using sketching and preconditioning. SIAM J. Matrix Analysis Applications.
  7. Quasi-Monte Carlo feature maps for shift-invariant kernels. Journal of Machine Learning Research, 17(120):1–38.
  8. Using fast weights to attend to the recent past. In Proc. of NeurIPS.
  9. Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In Proc. of ICLR.
  10. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR.
  11. Findings of the 2020 conference on machine translation (WMT20). In Proc. of WMT.
  12. Longformer: The long-document transformer.
  13. S. Bochner. 1955. Harmonic Analysis and the Theory of Probability. University of California Press.
  14. Findings of the 2014 workshop on statistical machine translation. In Proc. of WMT.
  15. Findings of the 2018 conference on machine translation (WMT18). In Proc. of WMT.
  16. Audio chord recognition with recurrent neural networks. In Proc. of ISMIR.
  17. High-dimensional sequence transduction. In Proc. of ICASSP.
  18. Language models are few-shot learners. arXiv: 2005.14165.
  19. Report on the 11th IWSLT evaluation campaign. In Proc. of IWSLT.
  20. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069–1109.
  21. Recurrent positional embedding for neural machine translation. In Proc. of EMNLP.
  22. Generating long sequences with sparse transformers. arXiv: 1904.10509.
  23. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. of EMNLP.
  24. Youngmin Cho and Lawrence K. Saul. 2009. Kernel methods for deep learning. In Proc. of NeurIPS.
  25. Rethinking attention with performers. In Proc. of ICLR.
  26. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proc. of ACL.
  27. Fast and accurate deep network learning by exponential linear units (ELUs). In Proc. of ICLR.
  28. Eldan Cohen and Christopher Beck. 2019. Empirical analysis of beam search performance degradation in neural sequence models. In Proc. of ICML.
  29. Unsupervised cross-lingual representation learning at scale. In Proc. of ACL.
  30. First Quora Dataset Release: Question Pairs.
  31. Transformer-XL: Attentive language models beyond a fixed-length context. In Proc. of ACL.
  32. Universal transformers. In Proc. of ICLR.
  33. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL.
  34. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. of ICLR.
  35. Classical structured prediction losses for sequence to sequence learning. In Proc. of NAACL.
  36. On the evaluation of machine translation systems trained with back-translation. In Proc. of ACL.
  37. Reducing transformer depth on demand with structured dropout. In Proc. of ICLR.
  38. Markus Freitag and Yaser Al-Onaizan. 2017. Beam search strategies for neural machine translation. In Proc. of NGT.
  39. Exploring kernel functions in the softmax layer for contextual word classification. In International Workshop on Spoken Language Translation.
  40. Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
  41. Coordination among neural modules through a shared global workspace.
  42. Translationese in machine translation evaluation.
  43. Alex Graves. 2012. Sequence transduction with recurrent neural networks. In Representation Learning Workshop.
  44. Neural turing machines.
  45. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476.
  46. Learning to transduce with unbounded memory. In Proc. of NeurIPS.
  47. Non-autoregressive neural machine translation. In Proc. of ICLR.
  48. Modeling recurrence for transformer. In Proc. of NAACL.
  49. Teaching machines to read and comprehend. In Proc. of NeurIPS.
  50. Distilling the knowledge in a neural network. In NeurIPs Deep Learning and Representation Learning Workshop.
  51. Axial attention in multidimensional transformers. arXiv: 1912.12180.
  52. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation.
  53. Kernel methods in machine learning. Annals of Statistics, 36(3):1171–1220.
  54. Parameter-efficient transfer learning for NLP. In Proc. of ICML.
  55. Structured perceptron with inexact search. In Proc. of NAACL.
  56. When to finish? optimal beam search for neural text generation (modulo beam size). In Proc. of EMNLP.
  57. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL.
  58. Armand Joulin and Tomás Mikolov. 2015. Inferring algorithmic patterns with stack-augmented recurrent nets. In Proc. of NeurIPS.
  59. High accuracy protein structure prediction using deep learning. Nature, 596:583–589.
  60. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. In Proc. of ICLR.
  61. Finetuning pretrained transformers into RNNs. In Proc. of EMNLP.
  62. Bidimensional leaderboards: Generate and evaluate language hand in hand. In Proc. of NAACL.
  63. Transparent human evaluation for image captioning. In Proc. of NAACL.
  64. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proc. of ICML.
  65. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
  66. Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proc. of ICLR.
  67. Reformer: The efficient transformer. In Proc. of ICLR.
  68. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL.
  69. Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proc. of NGT.
  70. Stochastic beams and where to find them: The Gumbel-top-k trick for sampling sequences without replacement. In Proc. of ACL.
  71. Neural text summarization: A critical evaluation. In Proc. of EMNLP.
  72. Frances Y. Kuo and Dirk Nuyens. 2016. Application of quasi-monte carlo methods to elliptic pdes with random diffusion coefficients: A survey of analysis and implementation. Foundations of Computational Mathematics, 16(6):1631–1696.
  73. Self-attentive associative memory. In Proc. of ICML.
  74. Fastfood - approximating kernel expansions in loglinear time. In Proc. of ICML.
  75. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proc. of ICML.
  76. Simple recurrent units for highly parallelizable recurrence. In Proc. of EMNLP.
  77. Roger Levy. 2005. Probabilistic Models of Word Order and Syntactic Discontinuity. Ph.D. thesis, Stanford University.
  78. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. of ACL.
  79. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Proc. of NeurIPS.
  80. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proc. of ECCV.
  81. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proc. of Text Summarization Branches Out.
  82. Generating Wikipedia by summarizing long sequences. In Proc. of ICLR.
  83. RoBERTa: A robustly optimized BERT pretraining approach.
  84. Towards a learning theory of cause-effect inference. In Proc. of ICML.
  85. Learning to stop in structured prediction for neural machine translation. In Proc. of NAACL.
  86. Luna: Linear unified nested attention. In Proc. of NeurIPS.
  87. Learning word vectors for sentiment analysis. In Proc. of ACL.
  88. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proc. of ACL.
  89. Results of the WMT20 metrics shared task. In Proc. of WMT.
  90. Conditional Poisson stochastic beams. In Proc. of EMNLP.
  91. If beam search is the answer, what was the question? In Proc. of EMNLP.
  92. Best-first beam search. TACL.
  93. Regularizing and Optimizing LSTM Language Models. In Proc. of ICLR.
  94. Pointer sentinel mixture models. In Proc. of ICLR.
  95. Are sixteen heads really better than one? In Proc. of NeurIPS.
  96. Differentiable plasticity: training plastic neural networks with backpropagation. In Proc. of ICML.
  97. Document-level neural machine translation with hierarchical attention networks. In Proc. of EMNLP.
  98. Transformers with convolutional context for ASR. arXiv: 1904.11660.
  99. Kenton Murray and David Chiang. 2018. Correcting length bias in neural machine translation. In Proc. of WMT.
  100. Sebastian Nagel. 2016. News dataset available. https://commoncrawl.org/2016/10/news-dataset-available/.
  101. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proc. of CoNLL.
  102. Nikita Nangia and Samuel Bowman. 2018. ListOps: A diagnostic dataset for latent tree learning. In Proc. of NAACL Student Research Workshop.
  103. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proc. of EMNLP.
  104. Ani Nenkova. 2006. Summarization evaluation for text and speech: issues and approaches. In Proc. of INTERSPEECH.
  105. Fast function to function regression. In Proc. of AISTATS.
  106. Analyzing uncertainty in neural machine translation. In Proc. of ICML.
  107. fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL: Demonstrations.
  108. Scaling neural machine translation. In Proc. of WMT.
  109. BLEU: a method for automatic evaluation of machine translation. In Proc. of ACL.
  110. Stabilizing transformers for reinforcement learning. In Proc. of ICML.
  111. Image transformer. In Proc. of ICML.
  112. Random feature attention. In Proc. of ICLR.
  113. A mixture of h−1ℎ1h-1italic_h - 1 heads is better than hℎhitalic_h heads. In Proc. of ACL.
  114. Rational recurrences. In Proc. of EMNLP.
  115. Matt Post. 2018a. A call for clarity in reporting BLEU scores. In Proc. of WMT.
  116. Matt Post. 2018b. A call for clarity in reporting BLEU scores. In Proc. of WMT.
  117. Blockwise self-attention for long document understanding. In Findings of EMNLP.
  118. The ACL Anthology network. In Proc. of the Workshop on Text and Citation Analysis for Scholarly Digital Libraries.
  119. Language models are unsupervised multitask learners.
  120. Compressive transformers for long-range sequence modelling. In Proc. of ICLR.
  121. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
  122. Ali Rahimi and Benjamin Recht. 2007. Random features for large-scale kernel machines. In Proc. of NeurIPS.
  123. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP.
  124. Sampled softmax with random Fourier features. In Proc. of NeurIPS.
  125. COMET: A neural framework for MT evaluation. In Proc. of EMNLP.
  126. Unbabel’s participation in the WMT20 metrics shared task. In Proc. of WMT.
  127. Efficient content-based sparse attention with routing transformers. TACL.
  128. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv: 1910.01108.
  129. Linear transformers are secretly fast weight programmers. In Proc. of ICML.
  130. J. Schmidhuber. 1992. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139.
  131. J. Schmidhuber. 1993. Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets. In Proc. of ICANN.
  132. Get to the point: Summarization with pointer-generator networks. In Proc. of ACL.
  133. Neural machine translation of rare words with subword units. In Proc. of ACL.
  134. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proc. of AAAI.
  135. Efficient attention: Attention with linear complexities. In Proc. of WACV.
  136. Aman Sinha and John C Duchi. 2016. Learning kernels with random features. In Proc. of NeurIPS.
  137. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of EMNLP.
  138. Felix Stahlberg and Bill Byrne. 2019. On NMT search errors and model errors: Cat got your tongue? In Proc. of EMNLP.
  139. Adaptive attention span in transformers. In Proc. of ACL.
  140. End-to-end memory networks. In Proc. of NeurIPS.
  141. Yitong Sun. 2019. Random Features Methods in Supervised Learning. Ph.D. thesis, The University of Michigan.
  142. Sequence to sequence learning with neural networks. In Proc. of NeurIPS.
  143. Multilingual translation from denoising pre-training. In Findings of ACL.
  144. Synthesizer: Rethinking self-attention in transformer models. arXiv: 2005.00743.
  145. Sparse sinkhorn attention. In Proc. of ICML.
  146. Long range arena: A benchmark for efficient transformers. In Proc. of ICLR.
  147. Efficient transformers: A survey.
  148. Facebook AI’s WMT21 news translation task submission. In Proc. of WMT.
  149. Trieu H. Trinh and Quoc V. Le. 2018. A simple method for commonsense reasoning.
  150. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proc. of EMNLP.
  151. Tensor2Tensor for neural machine translation. In Proc. of AMTA.
  152. Attention is all you need. In Proc. of NeurIPS.
  153. When a good translation is wrong in context: Context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In Proc. of ACL.
  154. Fast transformers with clustered attention. In Proc. of NeurIPS.
  155. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of ICLR.
  156. Cluster-Former: Clustering-based sparse transformer for long-range dependency encoding. In Findings of ACL.
  157. Linformer: Self-attention with linear complexity.
  158. Memory networks. In Proc. of ICLR.
  159. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of NAACL.
  160. Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270–280.
  161. Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-sequence learning as beam-search optimization. In Proc. of EMNLP.
  162. Transformers: State-of-the-art natural language processing. In Proc. of EMNLP: System Demonstrations.
  163. Pay less attention with lightweight and dynamic convolutions. In Proc. of ICLR.
  164. Lite transformer with long-short range attention. In Proc. of ICLR.
  165. Quasi-monte carlo feature maps for shift-invariant kernels. In Proc. of ICML.
  166. Breaking the beam search curse: A study of (re-)scoring methods and stopping criteria for neural machine translation. In Proc. of EMNLP.
  167. XLNet: Generalized autoregressive pretraining for language understanding. In Proc. of NeurIPS.
  168. Hierarchical attention networks for document classification. In Proc. of NAACL.
  169. Memory architectures in recurrent neural network language models. In Proc. of ICLR.
  170. Hard-coded Gaussian attention for neural machine translation. In Proc. of ACL.
  171. Orthogonal random features. In Proc. of NeurIPS.
  172. Big bird: Transformers for longer sequences. In Proc. of NeurIPS.
  173. Defending against neural fake news. In Proc. of NeurIPS.
  174. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proc. of ICML.
  175. VinVL: Making visual representations matter in vision-language models. In Proc. of CVPR.
  176. Challenges in automated debiasing for toxic language detection. In Proc. of EACL.
  177. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proc. of ICCV.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jungo Kasai (38 papers)
  2. Keisuke Sakaguchi (44 papers)
  3. Ronan Le Bras (56 papers)
  4. Dragomir Radev (98 papers)
  5. Yejin Choi (287 papers)
  6. Noah A. Smith (224 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com