Probabilistic Topic Modelling with Transformer Representations (2403.03737v1)
Abstract: Topic modelling was mostly dominated by Bayesian graphical models during the last decade. With the rise of transformers in Natural Language Processing, however, several successful models that rely on straightforward clustering approaches in transformer-based embedding spaces have emerged and consolidated the notion of topics as clusters of embedding vectors. We propose the Transformer-Representation Neural Topic Model (TNTM), which combines the benefits of topic representations in transformer-based embedding spaces and probabilistic modelling. Therefore, this approach unifies the powerful and versatile notion of topics based on transformer embeddings with fully probabilistic modelling, as in models such as Latent Dirichlet Allocation (LDA). We utilize the variational autoencoder (VAE) framework for improved inference speed and modelling flexibility. Experimental results show that our proposed model achieves results on par with various state-of-the-art approaches in terms of embedding coherence while maintaining almost perfect topic diversity. The corresponding source code is available at https://github.com/ArikReuter/TNTM.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
- R. Das, M. Zaheer, and C. Dyer, “Gaussian lda for topic models with word embeddings,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 795–804.
- A. B. Dieng, F. J. Ruiz, and D. M. Blei, “Topic modeling in embedding spaces,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 439–453, 2020.
- D. Angelov, “Top2vec: Distributed representations of topics,” arXiv preprint arXiv:2008.09470, 2020.
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds., vol. 26. Curran Associates, Inc., 2013. [Online]. Available: https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
- F. Bianchi, S. Terragni, D. Hovy, D. Nozza, and E. Fersini, “Cross-lingual contextualized topic models with zero-shot learning,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, Apr. 2021, pp. 1676–1683. [Online]. Available: https://aclanthology.org/2021.eacl-main.143
- F. Bianchi, S. Terragni, and D. Hovy, “Pre-training is a hot topic: Contextualized document embeddings improve topic coherence,” arXiv preprint arXiv:2004.03974, 2020.
- A. Srivastava and C. Sutton, “Autoencoding variational inference for topic models,” arXiv preprint arXiv:1703.01488, 2017.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine learning, vol. 42, no. 1, pp. 177–196, 2001.
- D. M. Blei and J. D. Lafferty, “A correlated topic model of science,” The annals of applied statistics, vol. 1, no. 1, pp. 17–35, 2007.
- J. Yin and J. Wang, “A dirichlet multinomial mixture model based approach for short text clustering.” In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 233–242, 2014.
- J. Mazarura, A. De Waal, and P. de Villiers, “A gamma-poisson mixture topic model for short text,” Mathematical Problems in Engineering, pp. 1–17, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
- S. Sia, A. Dalmia, and S. J. Mielke, “Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too!” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 1728–1736. [Online]. Available: https://aclanthology.org/2020.emnlp-main.135
- M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,” 2022. [Online]. Available: https://arxiv.org/abs/2203.05794
- A. Thielmann, Q. Seifert, A. Reuter, E. Bergherr, and B. Säfken, “Topics in the haystack: Extracting and evaluating topics beyond coherence,” arXiv preprint arXiv:2303.17324, 2023.
- J. H. Lau and T. Baldwin, “An empirical evaluation of doc2vec with practical insights into document embedding generation,” in Proceedings of the 1st Workshop on Representation Learning for NLP. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 78–86. [Online]. Available: https://aclanthology.org/W16-1609
- L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.
- L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based clustering,” The Journal of Open Source Software, vol. 2, no. 11, mar 2017. [Online]. Available: https://doi.org/10.21105%2Fjoss.00205
- N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3982–3992. [Online]. Available: https://aclanthology.org/D19-1410
- G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.
- R. Krestel, P. Fankhauser, and W. Nejdl, “Latent dirichlet allocation for tag recommendation,” in Proceedings of the third ACM conference on Recommender systems, 2009, pp. 61–68.
- X. Wei and W. B. Croft, “Lda-based document models for ad-hoc retrieval,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 178–185.
- Y. Mo, G. Kontonatsios, and S. Ananiadou, “Supporting systematic reviews using lda-based document representations,” Systematic reviews, vol. 4, no. 1, pp. 1–12, 2015.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020.
- A. Thielmann, C. Weisser, A. Krenz, and B. Säfken, “Unsupervised document classification integrating web scraping, one-class svm and lda topic modelling,” Journal of Applied Statistics, pp. 1–18, 2021.
- R. Bellman, “Dynamic programming and stochastic control processes,” Information and Control, vol. 1, no. 3, pp. 228–239, 1958. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0019995858800030
- E. Becht, L. McInnes, J. Healy, C.-A. Dutertre, I. W. Kwok, L. G. Ng, F. Ginhoux, and E. W. Newell, “Dimensionality reduction for visualizing single-cell data using umap,” Nature biotechnology, vol. 37, no. 1, pp. 38–44, 2019.
- A. Diaz-Papkovich, L. Anderson-Trocmé, and S. Gravel, “A review of umap in population genetics,” Journal of Human Genetics, vol. 66, no. 1, pp. 85–91, 2021.
- M. Allaoui, M. L. Kherfi, and A. Cheriet, “Considerably improving clustering algorithms using umap dimensionality reduction technique: a comparative study,” in International Conference on Image and Signal Processing. Springer, 2020, pp. 317–325.
- L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre-training,” arXiv preprint arXiv:2212.03533, 2022.
- P. Hennig, D. Stern, R. Herbrich, and T. Graepel, “Kernel topic models,” in Artificial intelligence and statistics. PMLR, 2012, pp. 511–519.
- B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 448–456. [Online]. Available: https://proceedings.mlr.press/v37/ioffe15.html
- “Huggingface: All-mpnet-base-v2,” https://huggingface.co/sentence-transformers/all-mpnet-base-v2, accessed: 2022-09-23.
- K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mpnet: Masked and permuted pre-training for language understanding,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 857–16 867, 2020.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- J. Lucas, G. Tucker, R. B. Grosse, and M. Norouzi, “Understanding posterior collapse in generative latent variable models,” in DGS@ICLR, 2019.
- D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic evaluation of topic coherence,” in Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, 2010, pp. 100–108.
- A. Hoyle, P. Goel, A. Hian-Cheong, D. Peskov, J. Boyd-Graber, and P. Resnik, “Is automated topic model evaluation broken? the incoherence of coherence,” Advances in Neural Information Processing Systems, vol. 34, pp. 2018–2033, 2021.
- C. Doogan and W. Buntine, “Topic model or topic twaddle? re-evaluating semantic interpretability measures,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 3824–3848.
- O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix factorization,” Advances in neural information processing systems, vol. 27, 2014.
- N. Aletras and M. Stevenson, “Evaluating topic coherence using distributional semantics,” in Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers, 2013, pp. 13–22.
- S. Terragni, E. Fersini, and E. Messina, “Word embedding-based topic similarity measures,” in International Conference on Applications of Natural Language to Information Systems. Springer, 2021, pp. 33–45.
- D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing Systems, T. Leen, T. Dietterich, and V. Tresp, Eds., vol. 13. MIT Press, 2000. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2000/file/f9d1152547c0bde01830b7e8bd60024c-Paper.pdf
- T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to latent semantic analysis,” Discourse processes, vol. 25, no. 2-3, pp. 259–284, 1998.
- Y. Teh, M. Jordan, M. Beal, and D. Blei, “Sharing clusters among related groups: Hierarchical dirichlet processes,” Advances in neural information processing systems, vol. 17, 2004.
- A. Gruber, Y. Weiss, and M. Rosen-Zvi, “Hidden topic markov models,” in Artificial intelligence and statistics. PMLR, 2007, pp. 163–170.
- N. Reimers and I. Gurevych, “Making monolingual sentence embeddings multilingual using knowledge distillation,” arXiv preprint arXiv:2004.09813, 2020.
- S. Terragni, E. Fersini, B. G. Galuzzi, P. Tropeano, and A. Candelieri, “Octis: comparing and optimizing topic models is simple!” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2021, pp. 263–270.
- J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” Advances in neural information processing systems, vol. 25, 2012.
- I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein auto-encoders,” arXiv preprint arXiv:1711.01558, 2017.
- I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework,” International Conference on Learning Representations, 2016.
- J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. [Online]. Available: https://aclanthology.org/D14-1162
- N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” arXiv preprint arXiv:2210.07316, 2022.
- D. D. Lewis, “Reuters-21578 text categorization collection data set,” 1997.
- L. McInnes, V. M. Maggio, and P. Fitzpatrick, “Umap package,” 2018. [Online]. Available: https://umap-learn.readthedocs.io/en/latest/index.html
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
- Arik Reuter (9 papers)
- Anton Thielmann (9 papers)
- Christoph Weisser (7 papers)
- Benjamin Säfken (12 papers)
- Thomas Kneib (44 papers)