Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration (2007.02086v1)

Published 4 Jul 2020 in cs.SI

Abstract: We present a manually-labeled Author Name Disambiguation(AND) Dataset called WhoisWho, which consists of 399,255 documents and 45,187 distinct authors with 421 ambiguous author names. To label such a great amount of AND data of high accuracy, we propose a novel annotation framework where the human and computer collaborate efficiently and precisely. Within the framework, we also propose an inductive disambiguation model to classify whether two documents belong to the same author. We evaluate the proposed method and other state-of-the-art disambiguation methods on WhoisWho. The experiment results show that: (1) Our model outperforms other disambiguation algorithms on this challenging benchmark. (2) The AND problem still remains largely unsolved and requires more in-depth research. We believe that such a large-scale benchmark would bring great value for the author name disambiguation task. We also conduct several experiments to prove our annotation framework could assist annotators to make accurate results efficiently and eliminate wrong label problems made by human annotators effectively.

Citations (3)

Summary

We haven't generated a summary for this paper yet.