Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Topic Modelling: Going Beyond Token Outputs (2401.12990v1)

Published 16 Jan 2024 in cs.CL and cs.LG

Abstract: Topic modelling is a text mining technique for identifying salient themes from a number of documents. The output is commonly a set of topics consisting of isolated tokens that often co-occur in such documents. Manual effort is often associated with interpreting a topic's description from such tokens. However, from a human's perspective, such outputs may not adequately provide enough information to infer the meaning of the topics; thus, their interpretability is often inaccurately understood. Although several studies have attempted to automatically extend topic descriptions as a means of enhancing the interpretation of topic models, they rely on external language sources that may become unavailable, must be kept up-to-date to generate relevant results, and present privacy issues when training on or processing data. This paper presents a novel approach towards extending the output of traditional topic modelling methods beyond a list of isolated tokens. This approach removes the dependence on external sources by using the textual data itself by extracting high-scoring keywords and mapping them to the topic model's token outputs. To measure the interpretability of the proposed outputs against those of the traditional topic modelling approach, independent annotators manually scored each output based on their quality and usefulness, as well as the efficiency of the annotation task. The proposed approach demonstrated higher quality and usefulness, as well as higher efficiency in the annotation task, in comparison to the outputs of a traditional topic modelling method, demonstrating an increase in their interpretability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. The role of social networks in information diffusion. In Proceedings of the 21st international conference on World Wide Web, pages 519–528, 2012.
  2. Analysis of the trends in biochemical research using latent dirichlet allocation (lda). Processes, 7(6):379, 2019.
  3. An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Information Processing & Management, 57(2):102034, 2020.
  4. An overview of topic discovery in twitter communication through social media analytics. Proceedings of the 21st Americas Conference on Information Systems (AMCIS) (2015), pp. 1-10, 2015.
  5. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, pages 261–270, 2010.
  6. Beyond lda: exploring supervised topic modeling for depression-related language in twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 99–107, 2015.
  7. Patient triage by topic modeling of referral letters: Feasibility study. JMIR Medical Informatics, 8(11):e21252, 2020.
  8. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235, 2004.
  9. Public opinion about the uk government during covid-19 and implications for public health: A topic modelling analysis of open-ended survey response data. medRxiv, 2021.
  10. Quantitative analysis of large amounts of journalistic texts using topic modelling. Digital Journalism, 4(1):89–106, 2016.
  11. How many topics? stability analysis for topic models. In Joint European conference on machine learning and knowledge discovery in databases, pages 498–513. Springer, 2014.
  12. In search of coherence and consensus: measuring the interpretability of statistical topics. Journal of Machine Learning Research, 18(169):1–32, 2018.
  13. Care and feeding of topic models: Problems, diagnostics, and improvements. Handbook of mixed membership models and their applications, 225255, 2014.
  14. Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing, pages 262–272, 2011.
  15. Topic model diagnostics: Assessing domain relevance via topical alignment. In International conference on machine learning, pages 612–620. PMLR, 2013.
  16. Tourist experiences at overcrowded attractions: A text analytics approach. Information and communication technologies in tourism 2021, pages 231–243, 2021.
  17. Topic modelling and social network analysis of publications and patents in humanoid robot technology. Journal of Information Science, 47(5):658–676, 2021.
  18. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499, 2007.
  19. Automated topic naming. Empirical Software Engineering, 18(6):1125–1155, 2013.
  20. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies, 105:28–42, 2017.
  21. Automatic labelling of topic models. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 1536–1545, 2011.
  22. Automatic labeling of topics. In 2009 Ninth International Conference on Intelligent Systems Design and Applications, pages 1227–1232. IEEE, 2009.
  23. Automatic labelling of topics with neural embeddings. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 953–963, 2016.
  24. Automatic labelling of topic models learned from twitter by summarisation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 618–624, 2014.
  25. Labelling topics using unsupervised graph-based methods. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 631–636, 2014.
  26. Unsupervised graph-based topic labelling using dbpedia. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 465–474, 2013.
  27. A knowledge-based topic modeling approach for automatic topic labeling. International Journal of Advanced Computer Science and Applications, 8(9):335, 2017.
  28. Onto_tml: Auto-labeling of topic models. Journal of Integrated Science and Technology, 9(2):85–91, 2021.
  29. Transfer learning for topic labeling: Analysis of the uk house of commons speeches 1935–2014. Research & Politics, 8(2):20531680211022206, 2021.
  30. Automatic labeling of topic models using text summaries. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2297–2305, 2016.
  31. Evaluating topic representations for exploring document collections. Journal of the Association for Information Science and Technology, 68(1):154–167, 2017.
  32. Automatic labelling of topic models using word vectors and letter trigram vectors. In AIRS, pages 253–264. Springer, 2015.
  33. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296, 2009.
  34. Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57, 1999.
  35. Susan T Dumais. Latent semantic analysis. Annual review of information science and technology, 38(1):188–230, 2004.
  36. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
  37. Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022.
  38. Dimo Angelov. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470, 2020.
  39. Probabilistic topic models. IEEE signal processing magazine, 27(6):55–65, 2010.
  40. Topic modeling for customer service chats. In 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pages 1–6. IEEE, 2021.
  41. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Frontiers in sociology, 7:886498, 2022.
  42. Scikit-learn. 0.24.1 linear discriminant analysis. https://scikit-learn.org/stable/modules/generated/sklearn. discriminantanalysis.LinearDiscriminantAnalysis.html. (Accessed on 05/05/2023).
  43. Automatic keyword extraction from individual documents. Text mining: applications and theory, 1:1–20, 2010.
  44. An automatic multidocument text summarization approach based on naive bayesian classifier using timestamp strategy. The Scientific World Journal, 2016, 2016.
  45. Text mining scientific data to extract relevant documents and auto-summarization. IJSTE - International Journal of Science Technology & Engineering, 4, 2017.
  46. Using crowdsourcing for labelling emotional speech assets. In W3C workshop on Emotion ML, 2010.
  47. Relation between agreement measures on human labeling and machine learning performance: Results from an art history image indexing domain. Computational Linguistics for Metadata Building, page 49, 2008.
  48. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing, pages 254–263, 2008.
  49. Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018.
  50. Ken Lang. Newsgroups data set. https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html. (Accessed 03/11/2023).
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com