Papers
Topics
Authors
Recent
2000 character limit reached

Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

Published 28 Oct 2024 in cs.CL | (2410.21054v2)

Abstract: Topic modeling is a key method in text analysis, but existing approaches are limited by assuming one topic per document or fail to scale efficiently for large, noisy datasets of short texts. We introduce Semantic Component Analysis (SCA), a novel topic modeling technique that overcomes these limitations by discovering multiple, nuanced semantic components beyond a single topic in short texts which we accomplish by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. It achieves competetive coherence and diversity compared to BERTopic, while uncovering at least double the semantic components and maintaining a noise rate close to zero. Furthermore, SCA is scalable and effective across languages, including an underrepresented one.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.