Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

d-blink: Distributed End-to-End Bayesian Entity Resolution (1909.06039v3)

Published 13 Sep 2019 in stat.CO, cs.DB, cs.LG, and stat.ML

Abstract: Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing models are severely limited in practice, as standard inference algorithms scale quadratically in the number of records. While scaling can be managed by fitting the model on separate blocks of the data, such a na\"ive approach may induce significant error in the posterior. In this paper, we propose a principled model for scalable Bayesian ER, called "distributed Bayesian linkage" or d-blink, which jointly performs blocking and ER without compromising posterior correctness. Our approach relies on several key ideas, including: (i) an auxiliary variable representation that induces a partition of the entities and records into blocks; (ii) a method for constructing well-balanced blocks based on k-d trees; (iii) a distributed partially-collapsed Gibbs sampler with improved mixing; and (iv) fast algorithms for performing Gibbs updates. Empirical studies on six data sets---including a case study on the 2010 Decennial Census---demonstrate the scalability and effectiveness of our approach.

Citations (19)

Summary

We haven't generated a summary for this paper yet.