Amortizing intractable inference in large language models (2310.04363v2)
Abstract: Autoregressive LLMs compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.
- Matthew J. Beal. Variational algorithms for approximate Bayesian inference, 2003. URL https://cse.buffalo.edu/faculty/mbeal/papers/beal03.pdf.
- Frank Benford. The law of anomalous numbers. Proceedings of the American Philosophical Society, 78(4):551–572, 1938. ISSN 0003049X. URL http://www.jstor.org/stable/984802.
- Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (NeurIPS), 2021.
- GFlowNet foundations. Journal of Machine Learning Research, (24):1–76, 2023.
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages, 7, jun 2023.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Bayesian structure learning with generative flow networks. Uncertainty in Artificial Intelligence (UAI), 2022.
- Joint Bayesian inference of graphical structure and parameters with a single generative flow network. arXiv preprint arXiv:2305.19366, 2023.
- Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):1–38, 1977.
- Language model cascades. arXiv preprint arXiv:2207.10342, 2022.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
- Maximum entropy RL (provably) solves some robust RL problems. International Conference on Learning Representations (ICLR), 2022.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.
- Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760, 2022.
- Amortized inference in probabilistic reasoning. Cognitive Science, 36, 2014.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Exposing the implicit energy networks behind masked language models via Metropolis-Hastings. International Conference on Learning Representations (ICLR), 2022.
- Efficient (soft) Q-learning for text generation with limited good data. arXiv preprint arXiv:2106.07704, 2021.
- Reinforcement learning with deep energy-based policies. International Conference on Machine Learning (ICML), 2017.
- Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
- DeBERTa: Decoding-enchanced BERT with disentangled attention. International Conference on Learning Representations (ICLR), 2021.
- Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1535–1546, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1141. URL https://aclanthology.org/P17-1141.
- The curious case of neural text degeneration. International Conference on Learning Representations (ICLR), 2019.
- Lora: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR), 2022.
- GFlowNet-EM for learning compositional latent variable models. International Conference on Machine Learning (ICML), 2023.
- Improved lexically constrained decoding for translation and monolingual rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 839–850, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1090. URL https://aclanthology.org/N19-1090.
- MathPrompter: Mathematical reasoning using large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pp. 37–42, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-industry.4. URL https://aclanthology.org/2023.acl-industry.4.
- Biological sequence design with GFlowNets. International Conference on Machine Learning (ICML), 2022.
- GFlowNets for AI-driven scientific discovery. Digital Discovery, 2(3):557–577, 2023.
- Length generalization in arithmetic transformers. arXiv preprint arXiv:2306.15400, 2023.
- Language models are zero-shot reasoners. Neural Information Processing Systems (NeurIPS), 2022.
- Probabilistic graphical models: principles and techniques. MIT press, 2009.
- Teaching arithmetic to small transformers. arXiv preprint arXiv:2307.03381, 2023.
- Sequential Monte Carlo steering of large language models using probabilistic programs. arXiv preprint arXiv:2306.03081, 2023.
- Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12286–12312, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URL https://aclanthology.org/2023.acl-long.687.
- DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6691–6706, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522.
- TIGS: An inference algorithm for text infilling with gradient search. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4146–4156, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1406. URL https://aclanthology.org/P19-1406.
- NeuroLogic a*esque decoding: Constrained text generation with lookahead heuristics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 780–799, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.57. URL https://aclanthology.org/2022.naacl-main.57.
- WizardMath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308:09583, 2023.
- Learning GFlowNets from partial episodes for improved convergence and stability. International Conference on Machine Learning (ICML), 2023.
- Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1773–1781, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.151. URL https://aclanthology.org/2023.acl-short.151.
- Studying word order through iterative shuffling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10351–10366, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.809. URL https://aclanthology.org/2021.emnlp-main.809.
- Trajectory balance: Improved credit assignment in GFlowNets. Neural Information Processing Systems (NeurIPS), 2022a.
- Coherence boosting: When your pretrained language model is not paying enough attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8214–8236, Dublin, Ireland, May 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.565. URL https://aclanthology.org/2022.acl-long.565.
- GFlowNets and variational inference. International Conference on Learning Representations (ICLR), 2023.
- CGMH: Constrained sentence generation by Metropolis-Hastings sampling. Association for the Advancement of Artificial Intelligence (AAAI), 2019.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1098. URL https://aclanthology.org/N16-1098.
- Bridging the gap between value and policy based reinforcement learning. Neural Information Processing Systems (NIPS), 2017.
- Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 271–278, Barcelona, Spain, July 2004. doi: 10.3115/1218955.1218990. URL https://aclanthology.org/P04-1035.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2463–2473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1250. URL https://aclanthology.org/D19-1250.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
- Can LLMs generate random numbers? evaluating LLM sampling in controlled domains, 2023. URL http://people.csail.mit.edu/renda/llm-sampling-paper.
- It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.185. URL https://aclanthology.org/2021.naacl-main.185.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Word ordering without syntax. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2319–2324, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1255. URL https://aclanthology.org/D16-1255.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Lei Sha. Gradient-guided unsupervised lexically constrained text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8692–8703, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.701. URL https://aclanthology.org/2020.emnlp-main.701.
- Generating high-quality and informative conversation responses with sequence-to-sequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2210–2219, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1235. URL https://aclanthology.org/D17-1235.
- Towards understanding and improving gflownet training. International Conference on Machine Learning (ICML), 2023.
- Deep language networks: Joint prompt training of stacked llms using variational inference. arXiv preprint arXiv:2306.12509, 2023.
- Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation. Computational Linguistics, 29(1):97–133, 03 2003. ISSN 0891-2017. doi: 10.1162/089120103321337458. URL https://doi.org/10.1162/089120103321337458.
- Deriving language models from masked language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1149–1159, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.99. URL https://aclanthology.org/2023.acl-short.99.
- A-NeSI: A scalable approximate method for probabilistic neurosymbolic inference. arXiv preprint arXiv:2212.12393, 2022.
- Diverse beam search: Decoding diverse solutions from neural sequence models. Association for the Advancement of Artificial Intelligence (AAAI), 2018.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 30–36, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-2304. URL https://aclanthology.org/W19-2304.
- GPT-J-6B: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144, 2023a.
- Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023b.
- Chain-of-thought prompting elicits reasoning in large language models. Neural Information Processing Systems (NeurIPS), 2022.
- Consistency of a recurrent language model with respect to incomplete decoding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5553–5568, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.448. URL https://aclanthology.org/2020.emnlp-main.448.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Reprompting: Automated chain-of-thought prompt inference through Gibbs sampling. arXiv preprint arXiv:2305.09993, 2023.
- Probing BERT’s priors with serial reproduction chains. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 3977–3992, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.314. URL https://aclanthology.org/2022.findings-acl.314.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- STaR: Bootstrapping reasoning with reasoning. Neural Information Processing Systems (NeurIPS), 2022.
- Language generation via combinatorial constraint satisfaction: A tree search enhanced Monte-Carlo approach. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1286–1298, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.115. URL https://aclanthology.org/2020.findings-emnlp.115.
- BERTScore: Evaluating text generation with BERT. International Conference on Learning Representations (ICLR), 2020b.
- Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022.
- Large language models are human-level prompt engineers. International Conference on Learning Representations (ICLR), 2023.
- Text infilling. arXiv preprint arXiv:1901.00158, 2019.
- A variational perspective on generative flow networks. Transactions on Machine Learning Research (TMLR), 2023.
- Edward J. Hu (7 papers)
- Moksh Jain (30 papers)
- Eric Elmoznino (10 papers)
- Younesse Kaddar (7 papers)
- Guillaume Lajoie (58 papers)
- Yoshua Bengio (601 papers)
- Nikolay Malkin (54 papers)