Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A high-reproducibility and high-accuracy method for automated topic classification (1402.0422v1)

Published 3 Feb 2014 in stat.ML, cs.IR, cs.LG, and physics.soc-ph

Abstract: Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Andrea Lancichinetti (11 papers)
  2. M. Irmak Sirer (1 paper)
  3. Jane X. Wang (21 papers)
  4. Daniel Acuna (6 papers)
  5. Konrad Körding (1 paper)
  6. Luís A. Nunes Amaral (5 papers)
Citations (97)

Summary

We haven't generated a summary for this paper yet.