Papers
Topics
Authors
Recent
Search
2000 character limit reached

Duplicate Detection with Efficient Language Models for Automatic Bibliographic Heterogeneous Data Integration

Published 27 Apr 2015 in cs.DB | (1504.07597v1)

Abstract: We present a new method to detect duplicates used to merge different bibliographic record corpora with the help of lexical and social information. As we show, a trivial key is not available to delete useless documents. Merging heteregeneous document databases to get a maximum of information can be of interest. In our case we try to build a document corpus about the TOR molecule so as to extract relationships with other gene components from PubMed and WebOfScience document databases. Our approach makes key fingerprints based on n-grams. We made two documents gold standards using this corpus to make an evaluation. Comparison with other well-known methods in deduplication gives best scores of recall (95\%) and precision (100\%).

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.