Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep de Finetti: Recovering Topic Distributions from Large Language Models (2312.14226v1)

Published 21 Dec 2023 in cs.CL, cs.AI, cs.LG, and stat.ML

Abstract: LLMs can produce long, coherent passages of text, suggesting that LLMs, although trained on next-word prediction, must represent the latent structure that characterizes a document. Prior work has found that internal representations of LLMs encode one aspect of latent structure, namely syntax; here we investigate a complementary aspect, namely the document's topic structure. We motivate the hypothesis that LLMs capture topic structure by connecting LLM optimization to implicit Bayesian inference. De Finetti's theorem shows that exchangeable probability distributions can be represented as a mixture with respect to a latent generating distribution. Although text is not exchangeable at the level of syntax, exchangeability is a reasonable starting assumption for topic structure. We thus hypothesize that predicting the next token in text will lead LLMs to recover latent topic distributions. We examine this hypothesis using Latent Dirichlet Allocation (LDA), an exchangeable probabilistic topic model, as a target, and we show that the representations formed by LLMs encode both the topics used to generate synthetic data and those used to explain natural corpus data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In International Conference on Learning Representations, 2017.
  2. Jacob Andreas. Language models as agent models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  3. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
  4. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada, July 2017. Association for Computational Linguistics.
  5. Latent dirichlet allocation. Advances in Neural Information Processing Systems, 14, 2001.
  6. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, July 2018. Association for Computational Linguistics.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  8. De Finetti’s theorem for Markov chains. The Annals of Probability, pages 115–130, 1980.
  9. Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 134–139, Berlin, Germany, August 2016. Association for Computational Linguistics.
  10. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 240–248, Brussels, Belgium, November 2018. Association for Computational Linguistics.
  11. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nature human behaviour, 6(7):975–987, 2022.
  12. Distributional vectors encode referential attributes. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 12–21, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
  13. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  14. Visualisation and’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research, 61:907–926, 2018.
  15. Adam: A method for stochastic optimization. Computing Research Repository, abs/1412.6980, 2015.
  16. Arne Köhn. What’s in an embedding? analyzing word embeddings through multilingual evaluation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2067–2073, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
  17. Bruno: A deep recurrent model for exchangeable data. Advances in Neural Information Processing Systems, 31, 2018.
  18. Implicit representations of meaning in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1813–1827, Online, August 2021. Association for Computational Linguistics.
  19. How do transformers learn topic structure: Towards a mechanistic understanding. ArXiv, abs/2303.04245, 2023.
  20. Open sesame: Getting inside BERT’s linguistic knowledge. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 241–253, Florence, Italy, August 2019. Association for Computational Linguistics.
  21. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  22. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
  23. Topic discovery via latent space clustering of pretrained language model representations. Proceedings of the ACM Web Conference 2022, 2022.
  24. Pointer sentinel mixture models, 2016.
  25. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
  26. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020.
  27. Does string-based neural MT learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526–1534, Austin, Texas, November 2016. Association for Computational Linguistics.
  28. Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too! ArXiv, abs/2004.14914, 2020.
  29. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, 2019.
  30. Well-read students learn better: On the importance of pre-training compact models. arXiv: Computation and Language, 2019.
  31. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning, 2023.
  32. An explanation of in-context learning as implicit Bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  33. Probing BERT’s priors with serial reproduction chains. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3977–3992, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  34. On the inconsistencies of conditionals learned by masked language models. arXiv preprint arXiv:2301.00068, 2022.
  35. Is neural topic modelling better than clustering? an empirical study on clustering with contextual embeddings for topics. In North American Chapter of the Association for Computational Linguistics, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Liyi Zhang (10 papers)
  2. R. Thomas McCoy (33 papers)
  3. Theodore R. Sumers (16 papers)
  4. Jian-Qiao Zhu (12 papers)
  5. Thomas L. Griffiths (150 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com