Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Information Leakage in Data Linkage (2505.08596v1)

Published 13 May 2025 in cs.CR and cs.DB

Abstract: The process of linking databases that contain sensitive information about individuals across organisations is an increasingly common requirement in the health and social science research domains, as well as with governments and businesses. To protect personal data, protocols have been developed to limit the leakage of sensitive information. Furthermore, privacy-preserving record linkage (PPRL) techniques have been proposed to conduct linkage on encoded data. While PPRL techniques are now being employed in real-world applications, the focus of PPRL research has been on the technical aspects of linking sensitive data (such as encoding methods and cryptanalysis attacks), but not on organisational challenges when employing such techniques in practice. We analyse what sensitive information can possibly leak, either unintentionally or intentionally, in traditional data linkage as well as PPRL protocols, and what a party that participates in such a protocol can learn from the data it obtains legitimately within the protocol. We also show that PPRL protocols can still result in the unintentional leakage of sensitive information. We provide recommendations to help data custodians and other parties involved in a data linkage project to identify and prevent vulnerabilities and make their project more secure.

Summary

Information Leakage in Data Linkage

The paper "Information Leakage in Data Linkage" presents an in-depth investigation into the challenges and risks associated with linking databases that contain sensitive personal information. While data linkage is a common practice in various fields, including health, social sciences, government, and business, the need to preserve individuals' privacy during this process is paramount. Privacy-preserving record linkage (PPRL) protocols have been developed to address these concerns, but they primarily focus on technical aspects rather than organizational challenges, which the paper aims to address comprehensively.

Overview of Data Linkage and Privacy-Preserving Techniques

Data linkage is a process used to identify records representing the same entities across different databases. Historically known as record linkage, entity resolution, or duplicate detection, it involves linking data, often without unique identifiers, by relying on quasi-identifiers (QIDs) such as names, addresses, and dates of birth. Privacy concerns have given rise to PPRL techniques, designed to enable linkage without exposing sensitive identifiers or data payloads to external parties.

The primary objective of PPRL is to conduct record linkage using encoded data to protect personal details from being accessed by unauthorized entities. Despite advancements in PPRL, practical implementation often encounters challenges related to data leakage—whether intentional or unintentional—which this paper scrutinizes. The authors propose that even with PPRL, sensitive information can still be inadvertently leaked due to flaws in the linkage protocols and operational practices.

Organisational and Protocol Analysis

The paper categorizes the parties involved in data linkage protocols into database owners (DOs), linkage units (LUs), data mergers (DMs), data anonymizers (DAs), data users (DUs), and data producers (DPs). It elucidates their roles and potential vulnerabilities in typical data linkage processes. The analysis emphasizes that despite encoded data being used in PPRL, sensitive information related to matched and unmatched records can still be exposed through various communication steps between these parties.

For instance, separation principle-based protocols are discussed, where DOs send encoded QIDs to LUs for linkage, receiving match identifiers back for PD extraction and dispatch to the DM. A critical insight is that DOs still learn which records were matched, revealing sensitive information. Conversely, protocols with no data backflow minimize such leakage but increase risks at the DM by providing access to all records' PDs—matched and non-matched.

Implications and Directions for Future Research

This paper's findings have profound implications for practical implementations of data linkage protocols, calling for improved privacy-preserving methodologies that address organizational and technical vulnerabilities. It underscores the need for protocols that restrict information flow further, preventing leakage at every stage of the data linkage process.

Future research should focus on fortifying PPRL techniques against both honest-but-curious and potentially malicious parties, ensuring even encrypted and encoded data remains secure during linkage operations. Moreover, incorporating advanced monitoring, access controls, and secure communication channels can mitigate unintentional leakage risks.

Conclusion

In sum, while privacy-preserving techniques in data linkage have made significant strides, the authors highlight critical areas where further work is essential to prevent sensitive information leakage. By detailing the interactions and communication flows among parties in a linkage process, this paper effectively bridges the gap between technical research and actual organizational application, providing valuable insights for securing data linkage activities across administrative, health, social science, and commercial domains.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.