A network approach to topic models (1708.01677v2)

Published 4 Aug 2017 in stat.ML, cs.CL, physics.data-an, and physics.soc-ph

Abstract: One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.

PDF Abstract

A Network Approach to Topic Models

In the paper under review, the authors propose an innovative approach to topic modeling by leveraging insights and methodologies from community detection in complex networks. The central thesis of the paper revolves around a fundamental re-casting of topic modeling problems as community detection challenges in bipartite networks constructed from text corpora. This paper introduces an adaptation of the stochastic block model (SBM) to topic modeling, presenting an alternative to the widely-used Latent Dirichlet Allocation (LDA) framework, addressing several of its known limitations.

The authors begin by highlighting significant drawbacks in traditional topic modeling approaches, particularly LDA. These include issues such as an inappropriate reliance on Dirichlet priors, which are not well-suited to capture the statistical properties of real-world text corpora, notably the word frequency distributions characterized by Zipf's law. Another critical shortcoming of LDA is its lack of a principled method to determine the number of topics, leading to overfitting and misinterpretation.

In response to these shortcomings, the authors conceptualize the relationship between topic modeling and network community detection. They notably draw parallels between probabilistic latent semantic indexing (pLSI) and mixed-membership SBM, both methodologies characterizing the latent structure in terms of probabilistic assignments. This innovative perspective enables the application of sophisticated network analysis techniques to infer the underlying topical structure of text data, suggesting that methodological advancements in community detection can significantly enhance the performance of topic models.

The approach adopted involves representing document-word interactions as a bipartite multigraph. This network representation allows the employment of a non-parametric SBM with hierarchical priors, thus avoiding the rigid assumptions of Dirichlet distributions. The resultant framework can automatically infer the number of topics and the hierarchical relationships among them, simultaneously clustering documents and words across multiple layers of resolution.

Key results demonstrate the superiority of their SBM-based topic model over LDA in several artificial and real corpora. Notably, even when data is synthetically generated under assumptions favoring LDA, their SBM approach often outperforms traditional models in terms of model selection criteria, such as minimum description length. This suggests that the hierarchical SBM is more effective at capturing the complexities inherent in natural language and adjusting to varying corpus sizes and structures without requiring extrinsic determination of model parameters.

The implications of this research are profound, both theoretically and practically. Theoretically, the paper opens new lines of inquiry into the symbiotic relationship between topic models and network analysis methods. It suggests the potential for cross-disciplinary fertilization, whereby advances in one domain can inform and transform methodologies in the other. Practically, this unified approach holds promise for applications requiring automated content analysis of large text corpora, offering a more flexible and accurate tool for extracting latent topics.

Finally, the paper envisions future developments wherein the methodologies from network science might further influence and improve the flexibility and applicability of topic models. Given the explosion of digital text data across domains, such improvements are of paramount importance. The paper’s detailed examination and demonstration of network-based methods provide a clear pathway for such interdisciplinary growth and robust solutions to complex data challenges.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Martin Gerlach (24 papers)
Tiago P. Peixoto (45 papers)
Eduardo G. Altmann (52 papers)

Citations (199)

View on Semantic Scholar

A network approach to topic models (1708.01677v2)

A Network Approach to Topic Models

Related Papers