Multilingual transformer and BERTopic for short text topic modeling: The case of Serbian (2402.03067v1)

Published 5 Feb 2024 in cs.CL and cs.AI

Abstract: This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language. We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian. We also compared it to LDA and NMF on fully preprocessed text. The experiments were conducted on a dataset of tweets expressing hesitancy toward COVID-19 vaccination. Our results show that with adequate parameter setting, BERTopic can yield informative topics even when applied to partially pre-processed short text. When the same parameters are applied in both prepro-cessing scenarios, the performance drop on partially preprocessed text is minimal. Compared to LDA and NMF, judging by the keywords, BERTopic offers more informative topics and gives novel insights when the number of topics is not limited. The findings of this paper can be significant for re-searchers working with other morphologically rich low-resource languages and short text.

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that BERTopic significantly improves topic coherence in Serbian short texts, outperforming traditional models like LDA and NMF.
The paper leverages multilingual embedding models to reduce extensive preprocessing while effectively capturing semantic nuances in a morphologically complex language.
The paper reveals that BERTopic, particularly with paraphrase-multilingual-mpnet-base-v2, attains a high Topic Diversity score of 0.896 even on partially preprocessed data.

An Analysis of BERTopic in Multilingual Topic Modeling for Serbian Short Texts

The paper investigates the efficacy of BERTopic, an advanced topic modeling technique, applied to short text in the morphologically complex Serbian language. Prior efforts in this domain largely utilized traditional methods like Latent Dirichlet Allocation (LDA) and Nonnegative Matrix Factorization (NMF), which suffer from inherent shortcomings such as a requirement for extensive preprocessing and parameter tuning. In contrast, BERTopic leverages state-of-the-art pre-trained LLMs to generate embeddings, thus enabling enhanced semantic capture without extensive preprocessing.

The authors conducted experiments using a dataset of Serbian tweets expressing hesitancy towards COVID-19 vaccination. The paper evaluates BERTopic's performance employing three distinct multilingual embedding models—distiluse-base-multilingual-cased-v2, paraphrase-multilingual-MiniLM-L12-v2, and paraphrase-multilingual-mpnet-base-v2—across two preprocessing stages: partial and full. Notably, while conventional LDA and NMF approaches were applied to fully preprocessed texts, BERTopic additionally demonstrated significance on partially preprocessed data.

Key Findings and Implications

Topic Coherence and Diversity Outcomes: The research highlights that BERTopic, even with partial preprocessing, produces highly coherent and diverse topics. For instance, the paraphrase-multilingual-mpnet-base-v2 model depicted superior Topic Diversity (TD) on partial preprocessing, scoring .896. Such results underline BERTopic's ability to handle data sparseness and produce semantically rich topics, surpassing both LDA and NMF in Topic Coherence (TC).
Comparative Analysis with LDA and NMF: When applied to fully preprocessed data, BERTopic outperformed LDA in TC scores, a crucial advantage considering LDA's traditionally favored application to unseen text-like short user-generated content. Despite LDA achieving a slightly better TD score (.897), the overall coherence of topics renders BERTopic superior for extracting actionable insights.
Preprocessing Levels: The paper elucidates that with the use of multilingual sentence transformers, BERTopic's reliance on preprocessing levels is significantly reduced, proposing a paradigm shift for low-resource, morphologically rich languages like Serbian. It presents a reduced need for text lemmatization as a notable practical benefit.
Parameter Sensitivity and Flexibility: Another finding suggests BERTopic requires distinct parameter calibrations relative to LDA/NMF, showcasing its flexibility when unconstrained by topic numbers—a feature advantageous for exploratory analysis in NLP.

Theoretical and Practical Perspectives

Theoretically, this research confirms BERTopic's robust adaptability to a non-English context, expanding its potential application range to include low-resource languages with morphological complexities. Practically, it simplifies the preprocessing pipeline while preserving topic extractability, a notable advancement for rapid NLP deployments among resource-strained languages.

Future Directions

The paper hints at future research avenues, including applying BERTopic across diverse datasets to generalize conclusions. Another anticipated advancement involves exploring its predictive capabilities on novel documents and evaluating its performance using the Serbian-trained BERTić model, potentially enhancing contextual embeddings.

In summary, this paper presents strong empirical evidence that BERTopic offers an efficient and flexible solution for topic modeling in morphologically rich, low-resource language contexts, outperforming traditional models while simplifying preprocessing needs. Its implications span both the theoretical understanding of topic representation and the practical methodologies for NLP in varied linguistic environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/winsontang/status/1756372089320374404

https://twitter.com/susumuota/status/1762992294456697105

https://twitter.com/batman_in_samt/status/1756370944044441681

HackerNews

Multilingual transformer and BERTopic for topic modeling: The case of Serbian (56 points, 7 comments)