Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents (2203.15349v2)

Published 29 Mar 2022 in cs.CL

Abstract: Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from human-written summaries that are often very short (approx 8 sentences). This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract. Therefore, we release two extensive corpora mapping KPs of ~1.3M and ~100K scientific articles with their fully extracted text and additional metadata including publication venue, year, author, field of study, and citations for facilitating research on this real-world problem.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Debanjan Mahata (25 papers)
  2. Navneet Agarwal (2 papers)
  3. Dibya Gautam (1 paper)
  4. Amardeep Kumar (2 papers)
  5. Swapnil Parekh (5 papers)
  6. Yaman Kumar Singla (12 papers)
  7. Anish Acharya (27 papers)
  8. Rajiv Ratn Shah (108 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.