Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives (2504.14707v1)

Published 20 Apr 2025 in cs.CL

Abstract: This study explores BERTopic's potential for modeling open-ended Belgian Dutch daily narratives, contrasting its performance with Latent Dirichlet Allocation (LDA) and KMeans. Although LDA scores well on certain automated metrics, human evaluations reveal semantically irrelevant co-occurrences, highlighting the limitations of purely statistic-based methods. In contrast, BERTopic's reliance on contextual embeddings yields culturally resonant themes, underscoring the importance of hybrid evaluation frameworks that account for morphologically rich languages. KMeans performed less coherently than prior research suggested, pointing to the unique challenges posed by personal narratives. Our findings emphasize the need for robust generalization in NLP models, especially in underrepresented linguistic contexts.

Summary

Evaluation of BERTopic for Belgian Dutch Daily Narratives

This paper by Kandala and Hoemann presents an evaluation of the BERTopic model applied to open-ended daily narratives in Belgian Dutch, contrasting its performance with Latent Dirichlet Allocation (LDA) and KMeans clustering. The paper explores the capabilities of BERTopic in modeling informal textual data, addressing unique challenges posed by linguistic variability and contextual richness inherent in personal narratives, which are often diluted in social media posts.

Topic Modeling Techniques

The paper begins with a comprehensive overview of BERTopic and its comparison against traditional models like LDA and KMeans. BERTopic leverages pre-trained contextual embeddings and density-based clustering to yield semantically coherent topics. This approach contrasts with LDA's probabilistic model reliant on explicit word co-occurrence, and KMeans' geometric clustering optimized by frequency metrics.

BERTopic's pipeline employs several sophisticated techniques, including dimension reduction via UMAP and clustering using HDBSCAN, followed by topic extraction using c-TFIDF. This method enhances BERTopic’s ability to maintain semantic integrity, especially within morphologically rich languages like Belgian Dutch.

Methodology and Data

The dataset comprises over 24,000 texts from Belgian Dutch daily narratives collected over a 70-day period from native speakers, aged 18 to 65. The preprocessing involved removing dataset-specific tags, greetings, and author references, followed by lemmatization using Stanza. A customized stop word list was curated to suit the dataset's linguistic context.

The authors detail their choice of the jina-embeddings-v3 model for BERTopic, which outperformed other multilingual embeddings in capturing topic coherence on the Belgian Dutch data. This selection is attributed to Jina's Task LoRA adaptation, excelling in producing semantically consistent embeddings in low-resource languages.

Hyperparameter optimization was meticulously conducted across all models to maximize topic coherence and diversity, involving systematic tuning and evaluating coherence metrics such as c_v, c_npmi, u_mass, and c_uci.

Results and Analysis

The paper reports that while LDA achieved higher c_v scores compared to BERTopic, human evaluations indicated BERTopic generated more semantically coherent and culturally relevant topics. This disconnect likely stems from LDA's reliance on term co-occurrence, which may not capture nuanced semantic relationships in linguistically varied texts.

Despite the superior automated coherence scores of LDA, BERTopic produced better results on c_npmi, u_mass, and c_uci metrics, which account for embedding-driven semantic similarity. KMeans, while effective in certain Dutch textual contexts previously, fell short in coherence and interpretability in this paper, likely due to its dependency on frequency-based term associations.

Discussion and Implications

The authors discuss the importance of hybrid evaluation frameworks that incorporate both quantitative metrics and qualitative human judgment, especially for morphologically complex languages. BERTopic's success in identifying coherent themes such as "academic stress," "fitness," and "family time" illustrates its ability to capture linguistically and culturally embedded nuances that may be overlooked by more rigid topic modeling techniques.

This paper underscores the need for adaptable NLP models that can generalize effectively across underrepresented languages and informal narrative forms. While LDA and KMeans provide foundational insights into thematic structures, BERTopic's embedding-driven approach reflects a more contextually aware strategy.

Conclusion

By evaluating topic modeling methodologies on Belgian Dutch daily narratives, this research highlights the advantages of using BERTopic for capturing the subtleties of personal stories. It calls for further development and testing of NLP models that cater to diverse languages and informal narrative styles, preserving cultural distinctiveness while enhancing analytical clarity.

The findings from this paper contribute to a broader understanding of efficient text mining strategies across linguistically diverse datasets, encouraging future exploration into embedding-based approaches and hybrid evaluation models in NLP research.

Related Papers

YouTube

Show All Videos