Large Language Models Enable Few-Shot Clustering (2307.00524v1)
Abstract: Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a LLM can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.
- Charu C. Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Mining Text Data.
- David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms.
- Local algorithms for interactive clustering. Journal of Machine Learning Research, 18:3:1–3:35.
- Interactive clustering: A comprehensive review. ACM Comput. Surv., 53(1).
- Open information extraction from the web. In CACM.
- Semi-supervised clustering by seeding. In International Conference on Machine Learning.
- Active semi-supervision for pairwise constrained clustering. In SDM.
- Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, page 2787–2795, Red Hook, NY, USA. Curran Associates Inc.
- Razvan C. Bunescu and Marius Pasca. 2006. Using encyclopedic knowledge for named entity disambiguation. In Conference of the European Chapter of the Association for Computational Linguistics.
- Rich Caruana. 2013. Clustering: Probably approximately useless? In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, page 1259–1260, New York, NY, USA. Association for Computing Machinery.
- Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, Online. Association for Computational Linguistics.
- A method to accelerate human in the loop clustering. In Proceedings of the 2017 SIAM International Conference on Data Mining.
- Sajib Dasgupta and Vincent Ng. 2010. Which clustering do you want? inducing your ideal clustering with minimal feedback. J. Artif. Intell. Res., 39:581–632.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Identifying relations for open information extraction. In Conference on Empirical Methods in Natural Language Processing.
- GPTscore: Evaluate as you desire. ArXiv, abs/2302.04166.
- Minie: Minimizing facts in open information extraction. In Conference on Empirical Methods in Natural Language Processing.
- Opiec: An open information extraction corpus. In Proceedings of the Conference on Automatic Knowledge Base Construction (AKBC).
- A data-driven analysis of workers’ earnings on amazon mechanical turk. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems.
- John J Horton. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Working Paper 31122, National Bureau of Economic Research.
- Harold W. Kuhn. 1955. The hungarian method for the assignment problem. Naval Research Logistics (NRL), 52.
- An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1311–1316, Hong Kong, China. Association for Computational Linguistics.
- S. Lloyd. 1982. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137.
- David N. Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In International Conference on Information and Knowledge Management.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
- Multi-view clustering for open knowledge base canonicalization. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, page 1578–1588, New York, NY, USA. Association for Computing Machinery.
- One embedder, any task: Instruction-finetuned text embeddings. In arXiv.
- Cesi: Canonicalizing open knowledge bases using embeddings and side information. In Proceedings of the 2018 World Wide Web Conference, pages 1317–1327.
- Kiri L. Wagstaff and Claire Cardie. 2000. Clustering with instance-level constraints. In Proceedings of the Seventeenth International Conference on Machine Learning.
- Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning.
- Jianhua Yin and Jianyong Wang. 2016. A model-based approach for text clustering with outlier detection. 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 625–636.
- Supporting clustering with contrastive learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5419–5430, Online. Association for Computational Linguistics.
- A framework for deep constrained clustering - algorithms and advances. In ECML/PKDD.
- Clusterllm: Large language models as a guide for text clustering. ArXiv, abs/2305.14871.