Towards Probabilistically-Sound Beam Search with Masked Language Models
Abstract: Beam search with masked LLMs (MLMs) is challenging in part because joint probability distributions over sequences are not readily available, unlike for autoregressive models. However, estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering. Here we present probabilistically-sound methods for beam search with MLMs. First, we clarify the conditions under which it is theoretically sound to perform text infilling with MLMs using standard beam search. When these conditions fail, we provide a probabilistically-sound inference time modification with no additional computational complexity and demonstrate that it is superior to the aforementioned beam search in the expected conditions. We then present empirical results comparing several infilling approaches with MLMs across several domains. Notably, our method probes the inductive biases of MLMs and explores the surprising contextual sensitivity of mask tokens for text infilling.
- A learning algorithm for boltzmann machines*. Cognitive Science, 9(1):147–169.
- Barry C. Arnold and D. V. Gokhale. 1998. Distributions most nearly compatible with given families of conditional distributions. Test, 7(2):377–390.
- Restoring and attributing ancient texts using deep neural networks. Nature, 603(7900):280–283.
- David Bamman and Patrick J. Burns. 2020. Latin bert: A contextual language model for classical philology.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, page 632–642, Lisbon, Portugal. Association for Computational Linguistics.
- Christer Bruun and Jonathan Edmondson. 2014. The Oxford handbook of Roman epigraphy. Oxford University Press.
- Logion: Machine-learning based detection and correction of textual errors in greek philology. In Ancient Language Processing.
- Logion: Machine Learning for Greek Philology. ArXiv:2305.01099 [cs].
- Desmond DeVaul. 2023. Desformers. https://huggingface.co/ddevaul/desformers.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [cs].
- Enabling language models to fill in the blanks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, page 2492–2501, Online. Association for Computational Linguistics.
- Machine learning and the future of philology: A case study. TAPA, 153(1):253–284.
- Lucas Torroba Hennigen and Yoon Kim. 2023. Deriving Language Models from Masked Language Models. ArXiv:2305.15501 [cs].
- The curious case of neural text degeneration.
- LoRA: Low-Rank Adaptation of Large Language Models. ArXiv:2106.09685 [cs].
- Unsupervised hierarchical story infilling. In Proceedings of the First Workshop on Narrative Understanding, page 37–43, Minneapolis, Minnesota. Association for Computational Linguistics.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130. Publisher: American Association for the Advancement of Science.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- If beam search is the answer, what was the question? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 2173–2185, Online. Association for Computational Linguistics.
- Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. ArXiv:1808.08745 [cs].
- Brown corpus manual. Brown University.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, page 311, Philadelphia, Pennsylvania. Association for Computational Linguistics.
- Frederick Riemenschneider and Anette Frank. 2023. Exploring large language models for classical philology. arXiv preprint arXiv:2305.13698.
- Jeffrey A. Ruffolo and Ali Madani. 2024. Designing proteins with language models. Nature Biotechnology, 42(2):200–202.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Timo Schick and Hinrich Schütze. 2021. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page 2339–2352, Online. Association for Computational Linguistics.
- Uri Shaham and Omer Levy. 2022. What do you get when you cross beam search with nucleus sampling? In Proceedings of the Third Workshop on Insights from Negative Results in NLP, page 38–45, Dublin, Ireland. Association for Computational Linguistics.
- Blank language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 5186–5198, Online. Association for Computational Linguistics.
- Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning. Pages: 2023.11.09.566187 Section: New Results.
- I.PHI dataset: ancient greek inscriptions. https://github.com/sommerschield/iphi.
- Bidirectional beam search: Forward-backward inference in neural sequence models for fill-in-the-blank image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 7215–7223, Honolulu, HI. IEEE.
- Tibshirani, 2023. Stat 241B lecture notes. [link].
- UniProt. 2008. The universal protein resource (UniProt) - PubMed.
- Tom Young and Yang You. 2023. On the inconsistencies of conditionals learned by masked language models.
- Bertscore: Evaluating text generation with bert. (arXiv:1904.09675). ArXiv:1904.09675 [cs].
- Text infilling. arXiv preprint arXiv:1901.00158.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.