Energy-Based Diffusion Language Models for Text Generation (2410.21357v4)
Abstract: Despite remarkable progress in autoregressive LLMs, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion LLM (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by leveraging a pretrained autoregressive model or by finetuning a bidirectional transformer via noise contrastive estimation. We also propose an efficient generation algorithm via parallel important sampling. Comprehensive experiments on LLMing benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models' perplexity. We further show that, without any generation performance drop, our framework offers a 1.3$\times$ sampling speedup over existing diffusion models. Reproduced code is available at https://github.com/MinkaiXu/Energy-Diffusion-LLM.
- Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, 2021.
- Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, 2020.
- A continuous time framework for discrete denoising models. In Advances in Neural Information Processing Systems, 2022.
- Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=kQwSbv0BR4.
- On contrastive divergence learning. In International workshop on artificial intelligence and statistics, pp. 33–40. PMLR, 2005.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685, 2018.
- Residual energy-based models for text generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1l4SgHKDH.
- Diffusion models beat gans on image synthesis. In Neural Information Processing Systems, 2021.
- Continuous diffusion for categorical data. ArXiv, abs/2211.15089, 2022.
- Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024.
- Alan E Gelfand. Gibbs sampling. Journal of the American statistical Association, 95(452):1300–1304, 2000.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Bayesian flow networks. arXiv preprint arXiv:2308.07037, 2023.
- Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024.
- Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 297–304. JMLR Workshop and Conference Proceedings, 2010.
- John Hammersley. Monte carlo methods. Springer Science & Business Media, 2013.
- Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
- Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
- Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
- Autoregressive diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Lm8T39vLDTE.
- A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
- Discrete predictor-corrector diffusion models for image synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VM8batVBWvg.
- Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning, pp. 32819–32848. PMLR, 2024.
- Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3698–3707, 2018.
- Matt Mahoney. Large text compression benchmark. 2006. URL https://www.mattmahoney.net/dc/text.html.
- Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyZoi-WRb.
- Art B. Owen. Monte Carlo theory, methods and examples. https://artowen.su.domains/mc/, 2013.
- The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Simple and effective masked diffusion language models, 2024.
- Step-unrolled denoising autoencoders for text generation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=T0GpzBQ1Fg6.
- Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024.
- Training and inference on any-order autoregressive models the right way. Advances in Neural Information Processing Systems, 35:2762–2775, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Discrete flows: Invertible generative models of discrete data. Advances in Neural Information Processing Systems, 32, 2019.
- Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
- Attention is all you need. In NIPS, 2017.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
- Improving and unifying discrete&continuous-time discrete denoising diffusion. arXiv preprint arXiv:2402.03701, 2024.
- Latent normalizing flows for discrete sequences. In International Conference on Machine Learning, pp. 7673–7682. PMLR, 2019.