Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence (2303.17324v1)

Published 30 Mar 2023 in cs.CL

Abstract: Extracting and identifying latent topics in large text corpora has gained increasing importance in NLP. Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. This allows our model to detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Anton Thielmann (9 papers)
  2. Quentin Seifert (1 paper)
  3. Arik Reuter (9 papers)
  4. Elisabeth Bergherr (7 papers)
  5. Benjamin Säfken (12 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.