Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 172 tok/s Pro

GPT OSS 120B 434 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Identifying Participants in the Personal Genome Project by Name (A Re-identification Experiment) (1304.7605v1)

Published 29 Apr 2013 in cs.CY

Abstract: We linked names and contact information to publicly available profiles in the Personal Genome Project. These profiles contain medical and genomic information, including details about medications, procedures and diseases, and demographic information, such as date of birth, gender, and postal code. By linking demographics to public records such as voter lists, and mining for names hidden in attached documents, we correctly identified 84 to 97 percent of the profiles for which we provided names. Our ability to learn their names is based on their demographics, not their DNA, thereby revisiting an old vulnerability that could be easily thwarted with minimal loss of research value. So, we propose technical remedies for people to learn about their demographics to make better decisions.

Citations (209)

View on Semantic Scholar

Summary

The paper demonstrates that linking publicly available demographic details with external voter records can re-identify PGP participants with 84-97% accuracy.
It employs a systematic data matching methodology on 579 profiles to expose vulnerabilities in de-identified genomic databases.
The findings underscore the need for enhanced data protection protocols and revised consent frameworks in genomics research.

An Analysis of Re-Identification Risks in the Personal Genome Project

The paper by Sweeney, Abu, and Winn, titled "Identifying Participants in the Personal Genome Project by Name," performs a thorough investigation into the re-identification vulnerabilities associated with public profiles within the Personal Genome Project (PGP). This paper is particularly relevant in the context of growing concerns over privacy regarding genotypic and phenotypic data shared publicly for research purposes. The authors successfully demonstrate that despite de-identification measures, it remains feasible to re-associate names with Personal Genome Project profiles by leveraging publicly available demographic data. This paper revisits historical privacy concerns and underscores the persistent issues stemming from demographic data sharing.

Methods and Experiments

The paper's methodology revolves around linking publicly available demographic details of PGP participants (such as date of birth, gender, and ZIP code) with external public records like voter lists to uncover participants' identities. The authors utilized a combination of techniques, including matching profiles against a national sample of voter registrations and mining identifiable information embedded within associated documents. The experiments included testing a dataset of 579 profiles, which contained the critical demographic identifiers necessary for the linkage.

Findings

The research findings indicate that the probability of accurately re-identifying profiles is remarkably high, ranging from 84% to 97% when considering potential nickname variations. By applying different re-identification strategies, the researchers linked a significant portion of profiles to individual names, with public records and voter data contributing primarily to the successful matches. A striking 42% of profiles were re-identified through a combination of methods. However, a discrepancy due to data mismatches, the temporal gap, and possibly outsider access to the full datasets was noted.

Implications

This paper has profound implications for privacy in genomics research. The ability to re-identify individuals underscores the inadequacy of current anonymization techniques in the face of advancing data linkage capabilities. It raises considerable concerns about privacy and potential misuse of sensitive genetic and medical information beyond consensual research purposes. By using demographic data, which were not directly genomic in nature, the paper highlights a longstanding privacy vulnerability, echoing earlier works by Sweeney and others. The implications extend to data custodians and policy makers who need to reassess protocols and consent frameworks to mitigate identified privacy risks.

Practical Suggestions and Future Work

In light of these findings, the authors propose several practical measures for PGP participants to better safeguard their identity, such as altering demographic data explicitly shared. They suggest technical solutions like the use of a continuity of care record editor to limit specific identifiers, thereby reducing re-identification risk. They also developed a web tool for individuals to assess their demographic uniqueness.

Future research directions might explore more robust de-identification methodologies that consider the growing sophistication of data linkages enabled by technology. Additionally, the exploration of broader policy frameworks addressing potential privacy breaches could prove beneficial, ensuring genotype-phenotype research advances without compromising personal privacy.

Conclusion

The paper demonstrates a critical examination of privacy within the Personal Genome Project by effectively utilizing a blend of historical and contemporary data re-identification methods. The results stress the importance of continuously evolving data protection strategies in parallel with technological advancements in data science. The balance between data utility for genomics research and individual privacy rights remains delicate, requiring ongoing optimization to safeguard participant data against emerging threats.