Principled Gradient-based Markov Chain Monte Carlo for Text Generation (2312.17710v1)
Abstract: Recent papers have demonstrated the possibility of energy-based text generation by adapting gradient-based sampling algorithms, a paradigm of MCMC algorithms that promises fast convergence. However, as we show in this paper, previous attempts on this approach to text generation all fail to sample correctly from the target LLM distributions. To address this limitation, we consider the problem of designing text samplers that are faithful, meaning that they have the target text distribution as its limiting distribution. We propose several faithful gradient-based sampling algorithms to sample from the target energy-based text distribution correctly, and study their theoretical properties. Through experiments on various forms of text generation, we demonstrate that faithful samplers are able to generate more fluent text while adhering to the control objectives better.
- Structured voronoi sampling. In Thirty-seventh Conference on Neural Information Processing Systems.
- Pyro: Deep universal probabilistic programming. Journal of Machine Learning Research.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Stan: A probabilistic programming language. Journal of statistical software, 76(1).
- Geometric algorithms for sampling the flux space of metabolic networks. In International Symposium on Computational Geometry (SoCG 2021), pages 21:1–21:16.
- Ben Cousins and Santosh Vempala. 2016. A practical volume algorithm. Mathematical Programming Computation, 8(2):133–160.
- Benjamin R. Cousins and Santosh S. Vempala. 2014. Gaussian cooling and o*(n3) algorithms for volume and gaussian volume. SIAM J. Comput., 47:1237–1273.
- Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations.
- Residual energy-based models for text generation. In International Conference on Learning Representations.
- Analysis of a nonreversible Markov chain sampler. The Annals of Applied Probability, 10(3):726 – 752.
- A measure-theoretic characterization of tight language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9744–9770, Toronto, Canada. Association for Computational Linguistics.
- Yilun Du and Igor Mordatch. 2019. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems, volume 32.
- Hybrid Monte Carlo. Physics Letters B, 195(2):216–222.
- Rick Durrett. 2019. Probability: Theory and Examples, 5thth{}^{\text{th}}start_FLOATSUPERSCRIPT th end_FLOATSUPERSCRIPT edition. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
- Martin E. Dyer and Alan M. Frieze. 1988. On the complexity of computing the volume of a polyhedron. SIAM J. Comput., 17:967–974.
- Ioannis Z. Emiris and Vissarion Fisikopoulos. 2013. Efficient random-walk methods for approximating polytope volume. Proceedings of the thirtieth annual symposium on Computational geometry.
- Ioannis Z. Emiris and Vissarion Fisikopoulos. 2018. Practical polytope volume approximation. ACM Transactions on Mathematical Software (TOMS), 44:1 – 21.
- Stuart Geman and Donald Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(6):721–741.
- Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations, pages 43–48, Vancouver, Canada. Association for Computational Linguistics.
- Exposing the implicit energy networks behind masked language models via metropolis–hastings. In International Conference on Learning Representations.
- Oops i took a gradient: Scalable sampling for discrete distributions. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3831–3841. PMLR.
- Ulf Grenander and Michael I. Miller. 1994. Representations of knowledge in complex systems. Journal of the Royal Statistical Society. Series B (Methodological), 56(4):549–603.
- W. K. Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109.
- Scan order in Gibbs sampling: Models in which it matters and bounds on how much. In Advances in Neural Information Processing Systems, volume 29.
- Geoffrey E. Hinton. 2002. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771–1800.
- Matthew D. Hoffman and Andrew Gelman. 2014. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47):1593–1623.
- Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638–1649, Melbourne, Australia. Association for Computational Linguistics.
- Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis, 2ndnd{}^{\text{nd}}start_FLOATSUPERSCRIPT nd end_FLOATSUPERSCRIPT edition. Cambridge University Press.
- Alan M. Horowitz. 1991. A generalized guided monte carlo algorithm. Physics Letters B, 268:247–252.
- A. D. Kennedy. 1990. The Theory of Hybrid Stochastic Algorithms, pages 209–223. Springer US, Boston, MA.
- A.D. Kennedy and Brian Pendleton. 1991. Acceptances and autocorrelations in hybrid monte carlo. Nuclear Physics B - Proceedings Supplements, 20:118–121.
- Ctrl: A conditional transformer language model for controllable generation.
- GeDi: Generative discriminator guided sequence generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4929–4952, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Constrained sampling from language models via langevin dynamics in embedding spaces. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Energy-Based Models. In Predicting Structured Data. The MIT Press.
- David A. Levin and Yuval Peres. 2017. Markov Chains and Mixing Times, 2nd edition. American Mathematical Soc.
- Limitations of autoregressive models and their alternatives. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5147–5173, Online. Association for Computational Linguistics.
- David J. C. MacKay. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge University Press. Available from http://www.inference.phy.cam.ac.uk/mackay/itila/.
- James Martens and Ilya Sutskever. 2010. Parallelizable sampling of markov random fields. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 517–524, Chia Laguna Resort, Sardinia, Italy. PMLR.
- Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092.
- David Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2873–2878, Copenhagen, Denmark. Association for Computational Linguistics.
- Mix and match: Learning-free controllable text generation using energy language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 401–415, Dublin, Ireland. Association for Computational Linguistics.
- Hadi Mohasel Afshar and Justin Domke. 2015. Reflection, refraction, and Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, volume 28.
- Radford M. Neal. 1993. Probabilistic Inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, ON, Canada.
- Radford M. Neal. 2011. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, chapter 5. Chapman and Hall/CRC.
- M.E.J. Newman and G.T. Barkema. 1999. Monte Carlo Methods in Statistical Physics. Oxford: Clarendon Press.
- Discontinuous Hamiltonian Monte Carlo for discrete parameters and discontinuous likelihoods. Biometrika, 107(2):365–380.
- The E2E dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany. Association for Computational Linguistics.
- Ari Pakman and Liam Paninski. 2013. Auxiliary-variable exact hamiltonian monte carlo samplers for binary distributions. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
- Ari Pakman and Liam Paninski. 2014. Exact hamiltonian monte carlo for truncated multivariate gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542.
- Composable effects for flexible and accelerated probabilistic programming in numpyro. arXiv preprint arXiv:1912.11554.
- COLD decoding: Energy-based constrained text generation with Langevin dynamics. In Advances in Neural Information Processing Systems.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Benjamin Rhodes and Michael U. Gutmann. 2022. Enhanced gradient-based MCMC in discrete spaces. Transactions on Machine Learning Research.
- Gareth O. Roberts and Jeffrey S. Rosenthal. 1998. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60.
- Gareth O. Roberts and Richard L. Tweedie. 1996. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 – 363.
- Hamiltonian monte carlo without detailed balance. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 719–726, Bejing, China. PMLR.
- Yang Song and Stefano Ermon. 2020. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, volume 33, pages 12438–12448. Curran Associates, Inc.
- Max Welling and Yee Whye Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, page 681–688, Madison, WI, USA. Omnipress.
- Kevin Yang and Dan Klein. 2021. FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535, Online. Association for Computational Linguistics.
- Giacomo Zanella. 2020. Informed proposals for local MCMC in discrete spaces. Journal of the American Statistical Association, 115(530):852–865.
- A Langevin-like sampler for discrete distributions. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 26375–26396. PMLR.
- Continuous relaxations for discrete hamiltonian monte carlo. In Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc.