Low-resource Deep Entity Resolution with Transfer and Active Learning (1906.08042v1)

Published 17 Jun 2019 in cs.DB, cs.CL, and cs.LG

Abstract: Entity resolution (ER) is the task of identifying different representations of the same real-world entities across databases. It is a key step for knowledge base creation and text mining. Recent adaptation of deep learning methods for ER mitigates the need for dataset-specific feature engineering by constructing distributed representations of entity records. While these methods achieve state-of-the-art performance over benchmark data, they require large amounts of labeled data, which are typically unavailable in realistic ER applications. In this paper, we develop a deep learning-based method that targets low-resource settings for ER through a novel combination of transfer learning and active learning. We design an architecture that allows us to learn a transferable model from a high-resource setting to a low-resource one. To further adapt to the target dataset, we incorporate active learning that carefully selects a few informative examples to fine-tune the transferred model. Empirical evaluation demonstrates that our method achieves comparable, if not better, performance compared to state-of-the-art learning-based methods while using an order of magnitude fewer labels.

Authors (5)

Jungo Kasai (38 papers)
Kun Qian (87 papers)
Sairam Gurajada (13 papers)
Yunyao Li (43 papers)
Lucian Popa (24 papers)

Citations (129)

View on Semantic Scholar

Summary

The paper presents a novel deep learning strategy for entity resolution that minimizes the need for extensive labeled datasets through transfer and active learning.
It employs a transfer learning framework to adapt models from high-resource domains to scenarios with limited annotation, enhancing performance across benchmarks.
Experimental results demonstrate that the method achieves comparable or superior accuracy while significantly reducing label requirements.

Low-resource Deep Entity Resolution with Transfer and Active Learning

The paper presents a novel deep learning method for Entity Resolution (ER) tailored specifically for low-resource settings, leveraging transfer learning and active learning. It underscores an approach to counter the common limitation of deep learning models needing substantial labeled data to perform optimally. ER is a significant task as it reconciles different data representations of the same real-world entity, enabling consistent data utilization across databases.

Core Contributions

Deep Learning-based ER in Low-resource Settings: The authors introduce a method that minimizes the necessity for extensive labeled datasets while maintaining or surpassing current deep learning-based ER strategies. This is achieved by employing transfer and active learning.
Transfer Learning Framework: Harnesses pre-existing data with abundant labels (source data) to create models applicable to scenarios with limited labeled data (target data). By developing an adaptable network architecture, the paper ensures that models can transfer learning across datasets, leveraging shared attributes.
Active Learning Strategy: Introduces an innovative active learning paradigm, selecting a subset of informative samples for labeling. This involves sampling likely false positives and false negatives to enhance the model's adaptability to new data.

Experimental Findings and Results

Evaluation of the proposed method across diverse benchmark datasets demonstrates its efficacy, achieving comparable or superior performance with significantly fewer labels.
Empirical outcomes highlight how the combination of dataset adaptation and active learning identifies crucial examples, effectively reducing annotation efforts while maintaining high-resolution accuracy.

Implications and Future Developments

The research addresses a practical challenge in the realistic application of ER: the limited availability of labeled datasets. With its theoretical and practical implications, this work opens avenues for further exploration. Possible next steps could involve:

Scaling across Various Domains: The transfer learning approach can potentially be scaled and tested across varied data domains, not just citation, restaurant, and software categories.
Integrating Additional Data Sources: Future research could aim at incorporating more diverse sources of unlabeled data to enhance model generalization.
Cross-lingual Entity Resolution: Expanding this framework to handle multilingual databases could significantly broaden its applicability.

The seamless integration of transfer and active learning as showcased in this paper exemplifies a pivotal shift towards resource-efficient ER solutions. Enhanced by new advancements, such methods could consolidate their role as fundamental tools in data reconciliation tasks within the expansive domain of artificial intelligence.

PDF Markdown

Related Papers

YouTube

Show All Videos