Addressing Topic Granularity and Hallucination in Large Language Models for Topic Modelling (2405.00611v1)
Abstract: LLMs with their strong zero-shot topic extraction capabilities offer an alternative to probabilistic topic modelling and closed-set topic classification approaches. As zero-shot topic extractors, LLMs are expected to understand human instructions to generate relevant and non-hallucinated topics based on the given documents. However, LLM-based topic modelling approaches often face difficulties in generating topics with adherence to granularity as specified in human instructions, often resulting in many near-duplicate topics. Furthermore, methods for addressing hallucinated topics generated by LLMs have not yet been investigated. In this paper, we focus on addressing the issues of topic granularity and hallucinations for better LLM-based topic modelling. To this end, we introduce a novel approach that leverages Direct Preference Optimisation (DPO) to fine-tune open-source LLMs, such as Mistral-7B. Our approach does not rely on traditional human annotation to rank preferred answers but employs a reconstruction pipeline to modify raw topics generated by LLMs, thus enabling a fast and efficient training and inference framework. Comparative experiments show that our fine-tuning approach not only significantly improves the LLM's capability to produce more coherent, relevant, and precise topics, but also reduces the number of hallucinated topics.
- E Scott Adler and John Wilkerson. 2018. Congressional bills project: 1995-2018.
- Nikolaos Aletras and Mark Stevenson. 2014. Labelling topics using unsupervised graph-based methods. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 631–636.
- Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
- Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22.
- Enhanced short text modeling: Leveraging large language models for topic refinement. arXiv preprint arXiv:2403.17706.
- Rob Churchill and Lisa Singh. 2022. The evolution of topic modeling. ACM Computing Surveys, 54(10s):1–35.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Automatic labeling of topics. In 2009 Ninth International Conference on Intelligent Systems Design and Applications, pages 1227–1232. IEEE.
- Automatic labeling hierarchical topics. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2383–2386.
- Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499.
- Regularizing and optimizing lstm language models. In International Conference on Learning Representations.
- Large language models offer an alternative to the traditional approach of topic modelling. arXiv preprint arXiv:2403.16248.
- Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pages 100–108.
- OpenAI. 2024. Chatgpt [large language model]. https://chat.openai.com.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Topicgpt: A prompt-based topic modeling framework. arXiv preprint arXiv:2311.01449.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
- Gptopic: Dynamic and interactive topic representations. arXiv preprint arXiv:2403.03628.
- Towards interpreting topic models with chatgpt. In The 20th World Congress of the International Fuzzy Systems Association.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Ike Vayansky and Sathish AP Kumar. 2020. A review of topic modeling methods. Information Systems, 94:101582.
- Hanna M Wallach. 2006. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, pages 977–984.
- Xiaojun Wan and Tianming Wang. 2016. Automatic labeling of topic models using text summaries. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2297–2305.
- Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855.
- Yida Mu (14 papers)
- Peizhen Bai (7 papers)
- Kalina Bontcheva (64 papers)
- Xingyi Song (30 papers)