Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Polya Urn Document Language Model for Improved Information Retrieval

Published 3 Feb 2015 in cs.IR | (1502.00804v2)

Abstract: The multinomial LLM has been one of the most effective models of retrieval for over a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term-dependency, that is the tendency of a term to repeat itself within a document (i.e. word burstiness). In this article, we model document generation as a random process with reinforcement (a multivariate Polya process) and develop a Dirichlet compound multinomial LLM that captures word burstiness directly. We show that the new reinforced LLM can be computed as efficiently as current retrieval models, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms the state-of-the-art LLM for a number of standard effectiveness metrics. Experiments also show that the tuning parameter in the proposed model is more robust than in the multinomial LLM. Furthermore, we develop a constraint for the verbosity hypothesis and show that the proposed model adheres to the constraint. Finally, we show that the new LLM essentially introduces a measure closely related to idf which gives theoretical justification for combining the term and document event spaces in tf-idf type schemes.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.