On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation (2402.14052v1)
Abstract: This study addresses the application of encoder-only Pre-trained LLMs (PLMs) in keyphrase generation (KPG) amidst the broader availability of domain-tailored encoder-only models compared to encoder-decoder models. We investigate three core inquiries: (1) the efficacy of encoder-only PLMs in KPG, (2) optimal architectural decisions for employing encoder-only PLMs in KPG, and (3) a performance comparison between in-domain encoder-only and encoder-decoder PLMs across varied resource settings. Our findings, derived from extensive experimentation in two domains reveal that with encoder-only PLMs, although KPE with Conditional Random Fields slightly excels in identifying present keyphrases, the KPG formulation renders a broader spectrum of keyphrase predictions. Additionally, prefix-LM fine-tuning of encoder-only PLMs emerges as a strong and data-efficient strategy for KPG, outperforming general-domain seq2seq PLMs. We also identify a favorable parameter allocation towards model depth rather than width when employing encoder-decoder architectures initialized with encoder-only PLMs. The study sheds light on the potential of utilizing encoder-only PLMs for advancing KPG systems and provides a groundwork for future KPG methods. Our code and pre-trained checkpoints are released at https://github.com/uclanlp/DeepKPG.
- Select, extract and generate: Neural keyphrase generation with layer-wise coverage attention. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1389–1404, Online. Association for Computational Linguistics.
- Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
- Unilmv2: Pseudo-masked language models for unified language model pre-training.
- TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650, Online. Association for Computational Linguistics.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Simple unsupervised keyphrase extraction using sentence embeddings. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 221–229, Brussels, Belgium. Association for Computational Linguistics.
- Gábor Berend. 2011. Opinion expression mining by exploiting keyphrase extraction. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 1162–1170, Chiang Mai, Thailand. Asian Federation of Natural Language Processing.
- Florian Boudin. 2016. pke: an open source python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pages 69–73, Osaka, Japan.
- Florian Boudin. 2018. Unsupervised keyphrase extraction with multipartite graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 667–672, New Orleans, Louisiana. Association for Computational Linguistics.
- Florian Boudin and Ygor Gallina. 2021. Redefining absent keyphrases and their effect on retrieval effectiveness. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4185–4193, Online. Association for Computational Linguistics.
- Keyphrase generation for scientific document retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1118–1126, Online. Association for Computational Linguistics.
- TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 543–551, Nagoya, Japan. Asian Federation of Natural Language Processing.
- Language models are few-shot learners.
- Yake! collection-independent automatic keyword extractor. In European Conference on Information Retrieval, pages 806–810. Springer.
- HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25, Online. Association for Computational Linguistics.
- LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics.
- Neural keyphrase generation via reinforcement learning with adaptive rewards. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2163–2174, Florence, Italy. Association for Computational Linguistics.
- Keyphrase generation with correlation constraints. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4057–4066, Brussels, Belgium. Association for Computational Linguistics.
- An integrated approach for keyphrase generation via exploring the power of retrieval and extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2846–2856, Minneapolis, Minnesota. Association for Computational Linguistics.
- Exclusive hierarchical decoding for deep keyphrase generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1095–1105, Online. Association for Computational Linguistics.
- Title-guided encoding for keyphrase generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6268–6275.
- Chemberta: Large-scale self-supervised pretraining for molecular property prediction.
- Applying a generic sequence-to-sequence model for simple and effective keyphrase generation.
- Cristian Dascalu and Ştefan Trăuşan-Matu. 2021. Experiments with contextualized word embeddings for keyphrase extraction. In 2021 23rd International Conference on Control Systems and Computer Science (CSCS), pages 447–452.
- Kushal S. Dave and Vasudeva Varma. 2010. Pattern based keyword extraction for contextual advertising. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, page 1885–1888, New York, NY, USA. Association for Computing Machinery.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Samhaa R. El-Beltagy and Ahmed Rafea. 2010. KP-miner: Participation in SemEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 190–193, Uppsala, Sweden. Association for Computational Linguistics.
- Corina Florescu and Cornelia Caragea. 2017. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1105–1115, Vancouver, Canada. Association for Computational Linguistics.
- KPTimes: A large-scale dataset for keyphrase generation on news documents. In Proceedings of the 12th International Conference on Natural Language Generation, pages 130–135, Tokyo, Japan. Association for Computational Linguistics.
- Retrieval-augmented multilingual keyphrase generation with retriever-generator iterative training. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1233–1246, Seattle, United States. Association for Computational Linguistics.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):1–23.
- Matscibert: A materials domain language model for text mining and information extraction.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
- Corephrase: Keyphrase extraction for document clustering. In International workshop on machine learning and data mining in pattern recognition, pages 265–274.
- DEGREE: A data-efficient generation-based event extraction model. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1890–1908, Seattle, United States. Association for Computational Linguistics.
- Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342.
- Document-level entity-based extraction as template generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5257–5269, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Shu Huang and Jacqueline M Cole. 2022. Batterybert: A pretrained language model for battery database enhancement. Journal of Chemical Information and Modeling.
- Anette Hulth. 2003a. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pages 216–223.
- Anette Hulth. 2003b. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03, page 216–223, USA. Association for Computational Linguistics.
- Anette Hulth and Beáta B. Megyesi. 2006. A study on automatically extracted keywords in text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, page 537–544, USA. Association for Computational Linguistics.
- Steve Jones and Mark S. Staveley. 1999. Phrasier: A system for interactive document retrieval using keyphrases. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, page 160–167, New York, NY, USA. Association for Computing Machinery.
- SemEval-2010 task 5 : Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21–26, Uppsala, Sweden. Association for Computational Linguistics.
- Applying graph-based keyword extraction to document retrieval. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 864–868, Nagoya, Japan. Asian Federation of Natural Language Processing.
- Large dataset for keyphrases extraction. Technical report, University of Trento.
- Learning rich representation of keyphrases from text. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 891–906, Seattle, United States. Association for Computational Linguistics.
- Quantifying the carbon emissions of machine learning.
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Unsupervised keyphrase extraction by jointly modeling local and global context. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 155–164, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Reinforced keyphrase generation with bert-based sentence scorer. In 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pages 1–8.
- Keyphrase prediction with pre-trained language model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3180–3191.
- Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20.
- S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics.
- Scientific information extraction with semi-supervised neural tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2641–2651, Copenhagen, Denmark. Association for Computational Linguistics.
- BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6). Bbac409.
- Keyphrase generation with fine-grained evaluation-guided reinforcement learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 497–507, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- General-to-specific transfer labeling for domain adaptable keyphrase generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1602–1618, Toronto, Canada. Association for Computational Linguistics.
- An empirical study on neural keyphrase generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4985–5007, Online. Association for Computational Linguistics.
- Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 582–592, Vancouver, Canada. Association for Computational Linguistics.
- Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
- BERTweet: A pre-trained language model for English tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 9–14, Online. Association for Computational Linguistics.
- Thuy Dung Nguyen and Min-Yen Kan. 2007. Keyphrase extraction in scientific publications. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, pages 317–326, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Martin F. Porter. 1980. An algorithm for suffix stripping. Program, 40:211–218.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264–280.
- Keyphrase extraction from scholarly articles as sequence labeling using contextualized embeddings.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- BioMegatron: Larger biomedical domain language model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4700–4706, Online. Association for Computational Linguistics.
- Keyphrase extraction-based query expansion in digital libraries. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, page 202–209, New York, NY, USA. Association for Computing Machinery.
- Hyperbolic relevance matching for neural keyphrase extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5710–5720, Seattle, United States. Association for Computational Linguistics.
- Importance Estimation from Multiple Perspectives for Keyphrase Extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2726–2736, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Privacy at scale: Introducing the PrivaSeer corpus of web privacy policies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6829–6839, Online. Association for Computational Linguistics.
- Sportsbert. https://huggingface.co/microsoft/SportsBERT.
- Non-autoregressive text generation with pre-trained language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 234–243, Online. Association for Computational Linguistics.
- Sifrank: A new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access, 8:10896–10906.
- A preliminary exploration of GANs for keyphrase generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8021–8030, Online. Association for Computational Linguistics.
- Qalink: Enriching text documents with relevant q&a site contents. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.
- Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns, 3(4):100488.
- Well-read students learn better: On the importance of pre-training compact models.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Xiaojun Wan and Jianguo Xiao. 2008a. CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 969–976, Manchester, UK. Coling 2008 Organizing Committee.
- Xiaojun Wan and Jianguo Xiao. 2008b. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, page 855–860. AAAI Press.
- Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 347–354, Vancouver, British Columbia, Canada. Association for Computational Linguistics.
- In Proceedings of the fourth ACM conference on Digital libraries, pages 254–255. [link].
- Rethinking model selection and decoding for keyphrase generation with pre-trained sequence-to-sequence models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics.
- Representation learning for resource-constrained keyphrase generation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 700–716, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- UniKeyphrase: A unified extraction and generation framework for keyphrase prediction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 825–835, Online. Association for Computational Linguistics.
- Xiaoyuan Wu and Alvaro Bolivar. 2008. Keyword extraction for contextual advertisement. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, page 1195–1196, New York, NY, USA. Association for Computing Machinery.
- Semi-supervised learning for neural keyphrase generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4142–4153, Brussels, Belgium. Association for Computational Linguistics.
- One2Set: Generating diverse keyphrases as a set. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4598–4608, Online. Association for Computational Linguistics.
- One size does not fit all: Generating and evaluating variable number of keyphrases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7961–7975, Online. Association for Computational Linguistics.
- Defending against neural fake news. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 9054–9065. Curran Associates, Inc.
- SkillSpan: Hard and soft skill extraction from English job postings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4962–4984, Seattle, United States. Association for Computational Linguistics.
- Keyphrase extraction using deep recurrent neural networks on Twitter. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 836–845, Austin, Texas. Association for Computational Linguistics.
- World wide web site summarization. Web Intelli. and Agent Sys., 2(1):39–53.
- Keyphrase generation via soft and hard semantic corrections. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7757–7768, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Jing Zhao and Yuxiang Zhang. 2019. Incorporating linguistic constraints into keyphrase generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5224–5233, Florence, Italy. Association for Computational Linguistics.
- Di Wu (477 papers)
- Wasi Uddin Ahmad (41 papers)
- Kai-Wei Chang (292 papers)