Scalable Topical Phrase Mining from Text Corpora (1406.6312v2)

Published 24 Jun 2014 in cs.CL, cs.IR, and cs.LG

Abstract: While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the inference results of unigram-based topic models, or utilizes complex n-gram-discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized datasets. We propose a different approach that is both computationally efficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.

Citations (204)

View on Semantic Scholar

Summary

The paper introduces ToPMine, a framework that mines and segments phrases to build a bag-of-phrases for more coherent topics.
It employs statistical significance and agglomerative merging to efficiently constrain topic assignments in the PhraseLDA model.
User studies show that ToPMine improves topic interpretability while achieving perplexity levels comparable to traditional LDA.

Scalable Topical Phrase Mining from Text Corpora

The paper "Scalable Topical Phrase Mining from Text Corpora" presents the ToPMine framework, a method designed to efficiently and effectively discover topical phrases of mixed lengths from text corpora. This approach seeks to address the limitations of existing topic modeling strategies, which typically rely on unigrams, by extracting and utilizing phrases, thereby enhancing human interpretability of discovered topics without incurring significant computational costs.

Methodology

ToPMine is structured around two primary components: phrase mining with text segmentation, and phrase-constrained topic modeling. The authors first introduce a novel phrase-mining algorithm that segments documents into single and multi-word phrases based on frequent pattern mining, employing a statistical significance measure to filter out improbable candidates. This represents the core innovation, allowing the conversion of text into a 'bag-of-phrases' rather than the conventional 'bag-of-words'.

The process leverages the downward closure lemma and data-antimonotonicity properties to efficiently mine frequent phrases, reducing the candidate space significantly compared to general frequent pattern mining. This algorithm is implemented using an agglomerative approach that merges tokens into phrases guided by a context-specific statistical score.

Following the parsing of texts into phrases, the paper introduces PhraseLDA, a topic modeling algorithm that incorporates these phrases as constraints. PhraseLDA builds on the Latent Dirichlet Allocation (LDA) framework by treating each phrase as a coherent unit, thereby ensuring that all words within a phrase share the same topic assignment. Through a chain graph model, PhraseLDA enforces continuity constraints on topic assignments across phrases, simplifying computational complexity while maintaining or enhancing topic model precision, as validated by perplexity measurements.

Results and Evaluation

ToPMine was evaluated on various datasets, such as computer science paper titles and abstracts, news articles, and consumer reviews from Yelp, demonstrating significant improvements in interpretability without sacrificing performance. Two user studies were conducted to validate the quality of topic coherence and phrase quality, with results indicating that ToPMine offers improved topical separation and interpretability compared with competing methods. Statistically, PhraseLDA provides comparable perplexity to LDA, verifying the assumption that all words in mined phrases likely belong to the same topic.

Implications and Future Directions

The work's implications are twofold: practically, it offers a scalable method for text analysis that could be beneficial for fields requiring human interpretability of large textual datasets, such as digital humanities and social sciences; theoretically, it opens venues for exploration into nonparametric model approaches and more advanced phrase merging techniques to further refine phrase quality and model scalability.

Furthermore, the framework's inherent flexibility in handling phrases of varying lengths and the reduced complexity makes ToPMine applicable to diverse domains and data types without extensive parameter tuning, a significant advantage over current state-of-the-art methods. As text collections grow increasingly large and complex, innovations like ToPMine that address both efficiency and interpretability represent valuable developments in computational linguistics and artificial intelligence research.

PDF Markdown