The Author-Topic Model for Authors and Documents (1207.4169v1)

Published 11 Jul 2012 in cs.IR, cs.LG, and stat.ML

Abstract: We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.

Authors (4)

Michal Rosen-Zvi (9 papers)
Thomas Griffiths (9 papers)
Mark Steyvers (18 papers)
Padhraic Smyth (52 papers)

Citations (1,686)

View on Semantic Scholar

Summary

An Overview of The Author-Topic Model for Authors and Documents

The paper "The Author-Topic Model for Authors and Documents" by Rosen-Zvi et al. proposes a generative model that integrates authorship data into topic modeling. This work extends the well-known Latent Dirichlet Allocation (LDA) model to capture the interests of authors which is instrumental in dealing with large document collections.

Model Description

The main contribution of this paper is the Author-Topic (AT) model, a generative model that assigns each author a multinomial distribution over topics, and each topic a multinomial distribution over words. For documents authored by multiple individuals, this approach approximates the document's topic distribution as a mixture of the distributions associated with its authors.

In technical terms, the generation of a document in this model involves:

Choosing a distribution over topics for each author from a Dirichlet prior.
For each word in the document, randomly selecting an author.
Drawing a topic from the selected author’s topic distribution.
Finally, drawing a word from the selected topic’s word distribution.

For inference, exact methods are computationally intractable for large datasets. Consequently, the authors employ Gibbs sampling to estimate the model parameters. Gibbs sampling simplifies the calculation by enabling efficient integration over the latent topic and author variables required for the probabilistic model.

Experimental Evaluation

The model is validated using two datasets: 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. The authors compare the proposed AT model with standard LDA and an author model (which directly relates authors to words without topics). Here are the key findings:

In terms of perplexity, which measures the model's ability to predict unseen documents, the AT model performs better than the simpler author model and has comparable performance to LDA when small amounts of data are observed. For larger documents, the LDA can adapt better as it is not influenced by authors’ prior output.
Furthermore, the AT model provides functionality beyond perplexity. It can address tasks like identifying similar authors and computing the author-specific entropy. In practical scenarios such as reviewer recommendation systems, this could be highly beneficial.

For instance, the empirical results show intuitive clustering of authors and topics. Classic machine learning topics, such as reinforcement learning, mixture models, and Bayesian learning, are appropriately identified. Similarly, the entropy measures reveal authors known for multi-topic output.

Implications and Future Directions

The AT model presents a significant stepping stone in leveraging authorship information for improved document modeling. With better characterization of authors' interests, this model could enhance various applications, including:

Expert recommendation systems where matching reviewers with papers is crucial.
Author disambiguation in academic databases.
Enhanced search and recommendation systems in digital libraries.

Future improvements could involve integrating citation and co-authorship network data to provide richer modeling of academic relationships, combining stylometric features for improved author identification, and streamlining sampling algorithms for even larger datasets.

Overall, the AT model demonstrates substantial improvements in document modeling by harnessing the inherent structure within document collections—the authors and the topics they write about.

PDF Markdown

Related Papers

Find Related Papers