Deep Keyphrase Generation (1704.06879v3)

Published 23 Apr 2017 in cs.CL

Abstract: Keyphrase provides highly-condensed information that can be effectively used for understanding, organizing and retrieving text content. Though previous studies have provided many workable solutions for automated keyphrase extraction, they commonly divided the to-be-summarized content into multiple text chunks, then ranked and selected the most meaningful ones. These approaches could neither identify keyphrases that do not appear in the text, nor capture the real semantic meaning behind the text. We propose a generative model for keyphrase prediction with an encoder-decoder framework, which can effectively overcome the above drawbacks. We name it as deep keyphrase generation since it attempts to capture the deep semantic meaning of the content with a deep learning method. Empirical analysis on six datasets demonstrates that our proposed model not only achieves a significant performance boost on extracting keyphrases that appear in the source text, but also can generate absent keyphrases based on the semantic meaning of the text. Code and dataset are available at https://github.com/memray/OpenNMT-kpg-release.

View on arXiv

Authors (6)

Rui Meng (55 papers)
Sanqiang Zhao (9 papers)
Shuguang Han (22 papers)
Daqing He (19 papers)
Peter Brusilovsky (15 papers)
Yu Chi (3 papers)

Citations (327)

View on Semantic Scholar

Summary

An Overview of "Deep Keyphrase Generation"

In the paper "Deep Keyphrase Generation," the authors propose an innovative method for keyphrase prediction, leveraging an encoder-decoder framework grounded in deep learning. The research tackles the limitations of existing keyphrase extraction techniques, which typically focus on extracting phrases that directly appear in the source text. The deep keyphrase generation model, however, is adept at both extracting present keyphrases and generating absent ones by comprehending the broader semantic meaning of the text.

The authors identify and address two primary deficiencies in traditional keyphrase extraction approaches. First, earlier methods are inherently restrictive as they can only extract keyphrases already present in the source text, thereby missing out on semantically relevant phrases that do not match any contiguous subsequence of source words. Second, these methods typically prioritize phrase candidates based on traditional metrics like TF-IDF and PageRank, which are insufficient for capturing the document's semantic content.

To overcome these limitations, the authors deploy a generative model based on recurrent neural networks (RNN) integrated with a copying mechanism. The RNN-based model utilizes an encoder-decoder structure to capture the semantic and syntactic features of the text. This framework compresses the semantic information into a dense vector, enhances it with a copying mechanism to allow positional learning, and employs an attention mechanism to dynamically focus on significant input components. The inclusion of the copying mechanism is particularly notable, as it enables the model to handle out-of-vocabulary words by allowing keyphrase components that may not be in the predetermined vocabulary to be copied directly from the source text.

The empirical results span six robust datasets, showcasing the proposed model's effectiveness over traditional extraction methodologies, notably improving absent keyphrase recall. Crucially, the CopyRNN model, with both attention and copying mechanisms, markedly outperformed baseline approaches such as Tf-Idf, TextRank, SingleRank, and supervised models like KEA and Maui.

The paper's implications extend both theoretically and practically. Theoretically, it affirms the utility of sequence-to-sequence frameworks in tasks beyond the typical scope of machine learning applications, illustrating their potential in generating absent keyphrases that align with the semantic content. Practically, the ability to generate absent keyphrases significantly enhances the quality of information retrieval, provides a more comprehensive view for summarization tasks, and could improve the indexing process in digital libraries and search engines.

Future research could consider extending the model's application to other domains and text types beyond scientific papers, such as books and online content. Additionally, exploring methods to interrelate and optimize dependencies among multiple target keyphrases could further enhance model output. This research contributes significantly to the domain of natural language processing by promising more accurate and semantically relevant keyphrase prediction.

Related Papers

Find Related Papers

GitHub

GitHub - memray/OpenNMT-kpg-release: Keyphrase Generation (216 stars)