Language Model as an Annotator: Unsupervised Context-aware Quality Phrase Generation (2312.17349v1)
Abstract: Phrase mining is a fundamental text mining task that aims to identify quality phrases from context. Nevertheless, the scarcity of extensive gold labels datasets, demanding substantial annotation efforts from experts, renders this task exceptionally challenging. Furthermore, the emerging, infrequent, and domain-specific nature of quality phrases presents further challenges in dealing with this task. In this paper, we propose LMPhrase, a novel unsupervised context-aware quality phrase mining framework built upon large pre-trained LLMs (LMs). Specifically, we first mine quality phrases as silver labels by employing a parameter-free probing technique called Perturbed Masking on the pre-trained LLM BERT (coined as Annotator). In contrast to typical statistic-based or distantly-supervised methods, our silver labels, derived from large pre-trained LLMs, take into account rich contextual information contained in the LMs. As a result, they bring distinct advantages in preserving informativeness, concordance, and completeness of quality phrases. Secondly, training a discriminative span prediction model heavily relies on massive annotated data and is likely to face the risk of overfitting silver labels. Alternatively, we formalize phrase tagging task as the sequence generation problem by directly fine-tuning on the Sequence-to-Sequence pre-trained LLM BART with silver labels (coined as Generator). Finally, we merge the quality phrases from both the Annotator and Generator as the final predictions, considering their complementary nature and distinct characteristics. Extensive experiments show that our LMPhrase consistently outperforms all the existing competitors across two different granularity phrase mining tasks, where each task is tested on two different domain datasets.
- Phrasescope: An effective and unsupervised framework for mining high quality phrases, in: Demeniconi, C., Davidson, I. (Eds.), Proceedings of the 2021 SIAM International Conference on Data Mining, SDM 2021, Virtual Event, April 29 - May 1, 2021, SIAM. pp. 639–647. URL: https://doi.org/10.1137/1.9781611976700.72, doi:10.1137/1.9781611976700.72.
- Applying a generic sequence-to-sequence model for simple and effective keyphrase generation. CoRR abs/2201.05302. URL: https://arxiv.org/abs/2201.05302, arXiv:2201.05302.
- What does BERT look at? an analysis of bert’s attention, in: Linzen, T., Chrupala, G., Belinkov, Y., Hupkes, D. (Eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, Association for Computational Linguistics. pp. 276--286. URL: https://doi.org/10.18653/v1/W19-4828, doi:10.18653/v1/W19-4828.
- A nonparametric method for extraction of candidate phrasal terms, in: Knight, K., Ng, H.T., Oflazer, K. (Eds.), ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25-30 June 2005, University of Michigan, USA, The Association for Computer Linguistics. pp. 605--613. URL: https://aclanthology.org/P05-1075/, doi:10.3115/1219840.1219915.
- BERT: pre-training of deep bidirectional transformers for language understanding, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics. pp. 4171--4186. URL: https://doi.org/10.18653/v1/n19-1423, doi:10.18653/v1/n19-1423.
- Scalable topical phrase mining from text corpora. Proc. VLDB Endow. 8, 305--316. URL: http://www.vldb.org/pvldb/vol8/p305-ElKishky.pdf, doi:10.14778/2735508.2735519.
- Linguistic terms and concepts. Macmillan International Higher Education.
- Kptimes: A large-scale dataset for keyphrase generation on news documents, in: van Deemter, K., Lin, C., Takamura, H. (Eds.), Proceedings of the 12th International Conference on Natural Language Generation, INLG 2019, Tokyo, Japan, October 29 - November 1, 2019, Association for Computational Linguistics. pp. 130--135. URL: https://aclanthology.org/W19-8617/, doi:10.18653/v1/W19-8617.
- Large-scale evaluation of keyphrase extraction models, in: Huang, R., Wu, D., Marchionini, G., He, D., Cunningham, S.J., Hansen, P. (Eds.), JCDL ’20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, Virtual Event, China, August 1-5, 2020, ACM. pp. 271--278. URL: https://doi.org/10.1145/3383583.3398517, doi:10.1145/3383583.3398517.
- Making science simple: Corpora for the lay summarisation of scientific literature, in: Goldberg, Y., Kozareva, Z., Zhang, Y. (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10589--10604. doi:10.18653/v1/2022.emnlp-main.724.
- Ucphrase: Unsupervised context-aware quality phrase tagging, in: Zhu, F., Ooi, B.C., Miao, C. (Eds.), KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, ACM. pp. 478--486. URL: https://doi.org/10.1145/3447548.3467397, doi:10.1145/3447548.3467397.
- Automatic keyphrase extraction: A survey of the state of the art, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, The Association for Computer Linguistics. pp. 1262--1273. URL: https://doi.org/10.3115/v1/p14-1119, doi:10.3115/v1/p14-1119.
- What does BERT learn about the structure of language?, in: Korhonen, A., Traum, D.R., Màrquez, L. (Eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Association for Computational Linguistics. pp. 3651--3657. URL: https://doi.org/10.18653/v1/p19-1356, doi:10.18653/v1/p19-1356.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Association for Computational Linguistics. pp. 7871--7880. URL: https://doi.org/10.18653/v1/2020.acl-main.703, doi:10.18653/v1/2020.acl-main.703.
- CITPM: A cluster-based iterative topical phrase mining framework, in: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (Eds.), Database Systems for Advanced Applications - 21st International Conference, DASFAA 2016, Dallas, TX, USA, April 16-19, 2016, Proceedings, Part I, Springer. pp. 197--213. URL: https://doi.org/10.1007/978-3-319-32025-0_13, doi:10.1007/978-3-319-32025-0_13.
- Efficiently mining high quality phrases from texts, in: Singh, S., Markovitch, S. (Eds.), Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, AAAI Press. pp. 3474--3481. URL: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14823.
- DGST: a dual-generator network for text style transfer, in: Webber, B., Cohn, T., He, Y., Liu, Y. (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. pp. 7131--7136. URL: https://aclanthology.org/2020.emnlp-main.578, doi:10.18653/v1/2020.emnlp-main.578.
- Mining quality phrases from massive text corpora, in: Sellis, T.K., Davidson, S.B., Ives, Z.G. (Eds.), Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, ACM. pp. 1729--1744. URL: https://doi.org/10.1145/2723372.2751523, doi:10.1145/2723372.2751523.
- Effective approaches to attention-based neural machine translation, in: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, The Association for Computational Linguistics. pp. 1412--1421. URL: https://doi.org/10.18653/v1/d15-1166, doi:10.18653/v1/d15-1166.
- Deep keyphrase generation, in: Barzilay, R., Kan, M. (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, Association for Computational Linguistics. pp. 582--592. URL: https://doi.org/10.18653/v1/P17-1054, doi:10.18653/v1/P17-1054.
- Towards the web of concepts: Extracting concepts from large datasets. Proc. VLDB Endow. 3, 566--577. URL: http://www.vldb.org/pvldb/vldb2010/pvldb_vol3/R50.pdf, doi:10.14778/1920841.1920914.
- Named entity aware transfer learning for biomedical factoid question answering. IEEE/ACM Transactions on Computational Biology and Bioinformatics 19, 2365--2376.
- Deep contextualized word representations, in: Walker, M.A., Ji, H., Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Association for Computational Linguistics. pp. 2227--2237. URL: https://doi.org/10.18653/v1/n18-1202, doi:10.18653/v1/n18-1202.
- KILT: a benchmark for knowledge intensive language tasks, in: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y. (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Association for Computational Linguistics. pp. 2523--2544. URL: https://doi.org/10.18653/v1/2021.naacl-main.200, doi:10.18653/v1/2021.naacl-main.200.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1--140:67. URL: http://jmlr.org/papers/v21/20-074.html.
- Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, The Association for Computer Linguistics. URL: https://doi.org/10.18653/v1/p16-1162, doi:10.18653/v1/p16-1162.
- Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30, 1825--1837. URL: https://doi.org/10.1109/TKDE.2018.2812203, doi:10.1109/TKDE.2018.2812203.
- Learning named entity tagger using domain-specific dictionary, in: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Association for Computational Linguistics. pp. 2054--2064. URL: https://doi.org/10.18653/v1/d18-1230, doi:10.18653/v1/d18-1230.
- Hiexpan: Task-guided taxonomy construction by hierarchical tree expansion, in: Guo, Y., Farooq, F. (Eds.), Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, ACM. pp. 2180--2189. URL: https://doi.org/10.1145/3219819.3220115, doi:10.1145/3219819.3220115.
- Sequence to sequence learning with neural networks, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104--3112. URL: https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html.
- What do you learn from context? probing for sentence structure in contextualized word representations, in: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net. URL: https://openreview.net/forum?id=SJzSgnRcKX.
- Pcae: A framework of plug-in conditional auto-encoder for controllable text generation. Knowledge-Based Systems , 109766URL: https://www.sciencedirect.com/science/article/pii/S0950705122008942.
- Attention is all you need, in: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998--6008. URL: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Mining infrequent high-quality phrases from domain-specific corpora, in: d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P. (Eds.), CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, ACM. pp. 1535--1544. URL: https://doi.org/10.1145/3340531.3412029, doi:10.1145/3340531.3412029.
- Perturbed masking: Parameter-free probing for analyzing and interpreting BERT, in: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Association for Computational Linguistics. pp. 4166--4176. URL: https://doi.org/10.18653/v1/2020.acl-main.383, doi:10.18653/v1/2020.acl-main.383.
- History-based attention in seq2seq model for multi-label text classification. Knowl. Based Syst. 224, 107094. URL: https://doi.org/10.1016/j.knosys.2021.107094, doi:10.1016/j.knosys.2021.107094.
- A unified generative framework for various NER subtasks, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Association for Computational Linguistics. pp. 5808--5822. URL: https://doi.org/10.18653/v1/2021.acl-long.451, doi:10.18653/v1/2021.acl-long.451.
- Gcn-based document representation for keyphrase generation enhanced by maximizing mutual information. Knowl. Based Syst. 243, 108488. URL: https://doi.org/10.1016/j.knosys.2022.108488, doi:10.1016/j.knosys.2022.108488.
- Dimsum @laysumm 20: Bart-based approach for scientific document summarization. CoRR abs/2010.09252. URL: https://arxiv.org/abs/2010.09252, arXiv:2010.09252.
- Aspect sentiment triplet extraction: A seq2seq approach with span copy enhanced dual decoder. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 2729--2742. URL: https://ieeexplore.ieee.org/abstract/document/9857593/.