KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation (2303.15422v4)
Abstract: Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation mainly relies on exact matching with human references. This scheme fails to recognize systems that generate keyphrases semantically equivalent to the references or diverse keyphrases that carry practical utility. To better assess the capability of keyphrase systems, we propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility. For each aspect, we design semantic-based metrics to reflect the evaluation objectives. Meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously proposed metrics. Using KPEval, we re-evaluate 23 keyphrase systems and discover that (1) established model comparison results have blind-spots especially when considering reference-free evaluation; (2) LLMs are underestimated by prior evaluation works; and (3) there is no single best model that can excel in all the aspects.
- Select, extract and generate: Neural keyphrase generation with layer-wise coverage attention. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1389–1404, Online. Association for Computational Linguistics.
- Hareesh Bahuleyan and Layla El Asri. 2020. Diverse keyphrase generation with neural unlikelihood training. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5271–5287, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Ken Barker and Nadia Cornacchia. 2000. Using noun phrase heads to extract document keyphrases. In Advances in Artificial Intelligence: 13th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, AI 2000 Montéal, Quebec, Canada, May 14–17, 2000 Proceedings 13, pages 40–52. Springer.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Simple unsupervised keyphrase extraction using sentence embeddings. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 221–229, Brussels, Belgium. Association for Computational Linguistics.
- Florian Boudin. 2018. Unsupervised keyphrase extraction with multipartite graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 667–672, New Orleans, Louisiana. Association for Computational Linguistics.
- Florian Boudin and Ygor Gallina. 2021. Redefining absent keyphrases and their effect on retrieval effectiveness. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4185–4193, Online. Association for Computational Linguistics.
- Termith-eval: a french standard-based resource for keyphrase extraction evaluation. In LREC-Language Resources and Evaluation Conference.
- Multilingual single document keyword extraction for information retrieval. 2005 International Conference on Natural Language Processing and Knowledge Engineering, pages 517–522.
- Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107–117.
- Language models are few-shot learners.
- Neural keyphrase generation via reinforcement learning with adaptive rewards. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2163–2174, Florence, Italy. Association for Computational Linguistics.
- An integrated approach for keyphrase generation via exploring the power of retrieval and extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2846–2856, Minneapolis, Minnesota. Association for Computational Linguistics.
- Exclusive hierarchical decoding for deep keyphrase generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1095–1105, Online. Association for Computational Linguistics.
- A statistical analysis of summarization evaluation metrics using resampling methods. Transactions of the Association for Computational Linguistics, 9:1132–1146.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Curran Associates Inc.
- SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
- Corina Florescu and Cornelia Caragea. 2017. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1105–1115, Vancouver, Canada. Association for Computational Linguistics.
- KPTimes: A large-scale dataset for keyphrase generation on news documents. In Proceedings of the 12th International Conference on Natural Language Generation, pages 130–135, Tokyo, Japan. Association for Computational Linguistics.
- SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Anna Glazkova and Dmitry Morozov. 2022. Applying transformer-based text summarization for keyphrase generation. arXiv preprint arXiv:2209.03791.
- Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41, Sofia, Bulgaria. Association for Computational Linguistics.
- Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association for Computational Linguistics.
- Anette Hulth. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pages 216–223.
- Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21.
- Steve Jones and Mark S. Staveley. 1999. Phrasier: A system for interactive document retrieval using keyphrases. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, page 160–167, New York, NY, USA. Association for Computing Machinery.
- SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
- Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
- Structure-augmented keyphrase generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2657–2667, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Evaluating n-gram based evaluation metrics for automatic keyphrase extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 572–580, Beijing, China. Coling 2010 Organizing Committee.
- Applying graph-based keyword extraction to document retrieval. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 864–868, Nagoya, Japan. Asian Federation of Natural Language Processing.
- PromptRank: Unsupervised keyphrase extraction using prompt. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9788–9801, Toronto, Canada. Association for Computational Linguistics.
- LipKey: A large-scale news dataset for absent keyphrases generation and abstractive summarization. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3427–3437, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Learning rich representation of keyphrases from text. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 891–906, Seattle, United States. Association for Computational Linguistics.
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Marina Litvak and Mark Last. 2008. Graph-based keyword extraction for single-document summarization. In Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization, pages 17–24, Manchester, UK. Coling 2008 Organizing Committee.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
- Keyphrase generation with fine-grained evaluation-guided reinforcement learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 497–507, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Chatgpt vs state-of-the-art models: A benchmarking study in keyphrase generation task. arXiv preprint arXiv:2304.14177.
- Yutaka Matsuo and Mitsuru Ishizuka. 2004. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01):157–169.
- Shikib Mehri and Maxine Eskenazi. 2020. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707, Online. Association for Computational Linguistics.
- Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 582–592, Vancouver, Canada. Association for Computational Linguistics.
- Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Martin F. Porter. 1980. An algorithm for suffix stripping. Program, 40:211–218.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR ’94, pages 232–241, London. Springer London.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Keyphrase extraction-based query expansion in digital libraries. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, page 202–209, New York, NY, USA. Association for Computing Machinery.
- Hyperbolic relevance matching for neural keyphrase extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5710–5720, Seattle, United States. Association for Computational Linguistics.
- Is chatgpt a good keyphrase generator? a preliminary study. arXiv preprint arXiv:2303.13001.
- Importance Estimation from Multiple Perspectives for Keyphrase Extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2726–2736, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Takashi Tomokiyo and Matthew Hurst. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 33–40, Sapporo, Japan. Association for Computational Linguistics.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Phrase-BERT: Improved phrase embeddings from BERT with an application to corpus exploration. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10837–10851, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721, Lisbon, Portugal. Association for Computational Linguistics.
- Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254–255.
- Pre-trained language models for keyphrase generation: A thorough empirical study.
- Rethinking masked language modeling for Chinese spelling correction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10743–10756, Toronto, Canada. Association for Computational Linguistics.
- Chatarena: Multi-agent language game environments for large language models. https://github.com/chatarena/chatarena.
- Open domain web keyphrase extraction beyond language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5175–5184, Hong Kong, China. Association for Computational Linguistics.
- One2Set: Generating diverse keyphrases as a set. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4598–4608, Online. Association for Computational Linguistics.
- Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems, volume 34, pages 27263–27277. Curran Associates, Inc.
- One size does not fit all: Generating and evaluating variable number of keyphrases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7961–7975, Online. Association for Computational Linguistics.
- Torsten Zesch and Iryna Gurevych. 2009. Approximate matching for evaluating keyphrase extraction. In Proceedings of the International Conference RANLP-2009, pages 484–489, Borovets, Bulgaria. Association for Computational Linguistics.
- Keyphrase extraction using deep recurrent neural networks on Twitter. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 836–845, Austin, Texas. Association for Computational Linguistics.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.