Top2Vec: Distributed Representations of Topics (2008.09470v1)

Published 19 Aug 2020 in cs.CL, cs.LG, and stat.ML

Abstract: Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present $\texttt{top2vec}$, which leverages joint document and word semantic embedding to find $\textit{topic vectors}$. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that $\texttt{top2vec}$ finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.

PDF Abstract

Top2Vec: Distributed Representations of Topics

The paper introduces Top2Vec, a novel approach to topic modeling that leverages distributed representations for discovering latent semantic structures in document corpora. Traditional topic modeling methods such as Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) have limitations, including reliance on bag-of-words representations, the requirement of predefined topic numbers, and preprocessing steps like stop-word removal, stemming, and lemmatization. These methods also face challenges in capturing word semantics and document structure.

Top2Vec addresses these deficiencies by using joint document and word semantic embeddings to locate topic vectors. Unlike LDA and PLSA, Top2Vec does not require prior knowledge of topic numbers or extensive preprocessing, and it inherently captures semantic similarities. The model's capability to spatially embed topics together with document and word vectors allows for more semantically informative and representative topics.

Methodological Insights

Top2Vec hinges on creating a joint document and word semantic space using doc2vec, specifically the Distributed Bag of Words (DBOW) version. This approach simultaneously learns document and word embeddings, capturing semantic associations between them. The model identifies dense clusters of document vectors in this space using Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to find document clusters. Each cluster's centroid represents a topic vector, with nearby word vectors indicating the semantic core of the topic.

Numerical Results and Analysis

The paper provides empirical evidence showing Top2Vec's superior performance in terms of topic information gain compared to LDA and PLSA. Utilizing metrics based on mutual information, the authors demonstrate that Top2Vec's topics are significantly more informative and aligned with document content. The experiments on datasets such as 20 News Groups and Yahoo Answers reveal that Top2Vec consistently achieves higher topic information gain scores, even without stop-word removal, by naturally excluding uninformative words.

Notably, the results highlight that Top2Vec discovers a greater number of semantically coherent topics automatically, in contrast to predefined topic numbers in traditional models.

Implications and Future Directions

Top2Vec brings forward a robust framework that alleviates the constraints of classical topic modeling by embracing distributed representations and bypassing extensive preprocessing. The implications are significant for natural language processing tasks requiring semantic richness and informativeness from topic-based analyses, such as document organization, information retrieval, and summarization.

In terms of theoretical impact, the adoption of a continuous topic representation aligns with modern advances in semantic embedding, offering a versatile framework for future developments. Potential areas for innovation include integrating advanced LLMs like BERT or GPT for enhanced semantic embeddings within Top2Vec's architecture.

Overall, Top2Vec represents a significant advance in topic modeling, employing innovation in distributed semantic representations to produce more semantically coherent and automatically determined topics. This work paves the way for more adaptive and automatically optimized modeling techniques that can take full advantage of modern NLP methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (1)

Dimo Angelov (1 paper)

Citations (300)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ddangelov/Top2Vec: Top2Vec learns jointly embedded topic, document and word vectors. (2,862 stars)