Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simple Unsupervised Keyphrase Extraction using Sentence Embeddings (1801.04470v3)

Published 13 Jan 2018 in cs.CL

Abstract: Keyphrase extraction is the task of automatically selecting a small set of phrases that best describe a given free text document. Supervised keyphrase extraction requires large amounts of labeled training data and generalizes very poorly outside the domain of the training data. At the same time, unsupervised systems have poor accuracy, and often do not generalize well, as they require the input document to belong to a larger corpus also given as input. Addressing these drawbacks, in this paper, we tackle keyphrase extraction from single documents with EmbedRank: a novel unsupervised method, that leverages sentence embeddings. EmbedRank achieves higher F-scores than graph-based state of the art systems on standard datasets and is suitable for real-time processing of large amounts of Web data. With EmbedRank, we also explicitly increase coverage and diversity among the selected keyphrases by introducing an embedding-based maximal marginal relevance (MMR) for new phrases. A user study including over 200 votes showed that, although reducing the phrases' semantic overlap leads to no gains in F-score, our high diversity selection is preferred by humans.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kamil Bennani-Smires (3 papers)
  2. Claudiu Musat (38 papers)
  3. Andreea Hossmann (10 papers)
  4. Michael Baeriswyl (13 papers)
  5. Martin Jaggi (155 papers)
Citations (193)

Summary

Simple Unsupervised Keyphrase Extraction using Sentence Embeddings

The paper presents EmbedRank, an innovative approach for unsupervised keyphrase extraction leveraging sentence embeddings. This method targets limitations in both supervised and existing unsupervised systems by focusing on extracting keyphrases from single documents without requiring a larger input corpus. Unlike traditional graph-based algorithms, EmbedRank utilizes embeddings to evaluate semantic relatedness and informativeness of phrases.

Methodology

EmbedRank operates through several key steps:

  1. Candidate Extraction: The system identifies candidate phrases using part-of-speech patterns, particularly sequences ending in nouns.
  2. Sentence Embeddings: Both the entire document and the candidate phrases are embedded into a high-dimensional vector space using techniques like Sent2Vec or Doc2Vec. This facilitates the assessment of semantic relatedness through similarity measures.
  3. Ranking: Candidates are ranked based on their cosine similarity to the document embedding, selecting those most pertinent to the document context.

EmbedRank++ introduces Maximal Marginal Relevance (MMR) to enhance diversity by balancing informativeness and dissimilarity among selected keyphrases, which addresses the issue of redundancy found in previous methods.

Empirical Evaluation

Empirical results demonstrate the effectiveness of EmbedRank in outperforming graph-based approaches across datasets of varying document lengths, such as Inspec and DUC2001, in terms of F-score. Notably, Sent2Vec proved superior to Doc2Vec, enhancing both speed and accuracy.

The paper also details a user paper where participants preferred the output of EmbedRank++ due to its enhanced diversity, despite a slight drop in F-score compared to EmbedRank. This suggests a gap between traditional evaluation metrics and user satisfaction, highlighting the importance of diversity in practical applications.

Implications and Future Work

The insights from this paper have notable implications:

  • Efficiency: EmbedRank’s ability to function independently of a larger corpus and its computational efficiency make it highly suitable for real-time applications such as social media analysis and news article summarization.
  • Usability: The focus on semantic embeddings offers an avenue for improving over-generation issues, potentially refining user experience across various text-processing applications.

The research emphasizes the need for evaluation methodologies that better align with human judgment, hinting at future exploration into more comprehensive evaluation metrics beyond F-score.

In summation, the EmbedRank framework demonstrates significant potential in advancing unsupervised keyphrase extraction, simultaneously offering a pragmatic solution for real-world applications and prompting further investigation into evaluation practices within AI and NLP contexts.