Evaluation of BERTopic for Belgian Dutch Daily Narratives
This paper by Kandala and Hoemann presents an evaluation of the BERTopic model applied to open-ended daily narratives in Belgian Dutch, contrasting its performance with Latent Dirichlet Allocation (LDA) and KMeans clustering. The paper explores the capabilities of BERTopic in modeling informal textual data, addressing unique challenges posed by linguistic variability and contextual richness inherent in personal narratives, which are often diluted in social media posts.
Topic Modeling Techniques
The paper begins with a comprehensive overview of BERTopic and its comparison against traditional models like LDA and KMeans. BERTopic leverages pre-trained contextual embeddings and density-based clustering to yield semantically coherent topics. This approach contrasts with LDA's probabilistic model reliant on explicit word co-occurrence, and KMeans' geometric clustering optimized by frequency metrics.
BERTopic's pipeline employs several sophisticated techniques, including dimension reduction via UMAP and clustering using HDBSCAN, followed by topic extraction using c-TFIDF. This method enhances BERTopic’s ability to maintain semantic integrity, especially within morphologically rich languages like Belgian Dutch.
Methodology and Data
The dataset comprises over 24,000 texts from Belgian Dutch daily narratives collected over a 70-day period from native speakers, aged 18 to 65. The preprocessing involved removing dataset-specific tags, greetings, and author references, followed by lemmatization using Stanza. A customized stop word list was curated to suit the dataset's linguistic context.
The authors detail their choice of the jina-embeddings-v3 model for BERTopic, which outperformed other multilingual embeddings in capturing topic coherence on the Belgian Dutch data. This selection is attributed to Jina's Task LoRA adaptation, excelling in producing semantically consistent embeddings in low-resource languages.
Hyperparameter optimization was meticulously conducted across all models to maximize topic coherence and diversity, involving systematic tuning and evaluating coherence metrics such as c_v, c_npmi, u_mass, and c_uci.
Results and Analysis
The paper reports that while LDA achieved higher c_v scores compared to BERTopic, human evaluations indicated BERTopic generated more semantically coherent and culturally relevant topics. This disconnect likely stems from LDA's reliance on term co-occurrence, which may not capture nuanced semantic relationships in linguistically varied texts.
Despite the superior automated coherence scores of LDA, BERTopic produced better results on c_npmi, u_mass, and c_uci metrics, which account for embedding-driven semantic similarity. KMeans, while effective in certain Dutch textual contexts previously, fell short in coherence and interpretability in this paper, likely due to its dependency on frequency-based term associations.
Discussion and Implications
The authors discuss the importance of hybrid evaluation frameworks that incorporate both quantitative metrics and qualitative human judgment, especially for morphologically complex languages. BERTopic's success in identifying coherent themes such as "academic stress," "fitness," and "family time" illustrates its ability to capture linguistically and culturally embedded nuances that may be overlooked by more rigid topic modeling techniques.
This paper underscores the need for adaptable NLP models that can generalize effectively across underrepresented languages and informal narrative forms. While LDA and KMeans provide foundational insights into thematic structures, BERTopic's embedding-driven approach reflects a more contextually aware strategy.
Conclusion
By evaluating topic modeling methodologies on Belgian Dutch daily narratives, this research highlights the advantages of using BERTopic for capturing the subtleties of personal stories. It calls for further development and testing of NLP models that cater to diverse languages and informal narrative styles, preserving cultural distinctiveness while enhancing analytical clarity.
The findings from this paper contribute to a broader understanding of efficient text mining strategies across linguistically diverse datasets, encouraging future exploration into embedding-based approaches and hybrid evaluation models in NLP research.