Mechanisms for Hiding Sensitive Genotypes with Information-Theoretic Privacy (2007.05139v4)

Published 10 Jul 2020 in cs.IT, cs.LG, and math.IT

Abstract: Motivated by the growing availability of personal genomics services, we study an information-theoretic privacy problem that arises when sharing genomic data: a user wants to share his or her genome sequence while keeping the genotypes at certain positions hidden, which could otherwise reveal critical health-related information. A straightforward solution of erasing (masking) the chosen genotypes does not ensure privacy, because the correlation between nearby positions can leak the masked genotypes. We introduce an erasure-based privacy mechanism with perfect information-theoretic privacy, whereby the released sequence is statistically independent of the sensitive genotypes. Our mechanism can be interpreted as a locally-optimal greedy algorithm for a given processing order of sequence positions, where utility is measured by the number of positions released without erasure. We show that finding an optimal order is NP-hard in general and provide an upper bound on the optimal utility. For sequences from hidden Markov models, a standard modeling approach in genetics, we propose an efficient algorithmic implementation of our mechanism with complexity polynomial in sequence length. Moreover, we illustrate the robustness of the mechanism by bounding the privacy leakage from erroneous prior distributions. Our work is a step towards more rigorous control of privacy in genomic data sharing.

Citations (1)

View on Semantic Scholar

Summary

The paper presents an erasure-based privacy mechanism that guarantees zero information leakage by decoupling released data from sensitive genotypes.
It employs a locally-optimal greedy algorithm to maximize released sequence length, despite the NP-hard nature of finding the optimal order.
Simulation results verify lower erasure rates and robust privacy protection, offering practical insights for secure genomic data sharing in personalized medicine.

In this paper, the authors address the critical issue of genomic privacy that arises with the increasing availability of personal genomics services. Specifically, they explore methods for allowing individuals to share their genomic data while ensuring the privacy of sensitive genotypes that could reveal critical health-related information. The presented work investigates the information-theoretic privacy problem through the development of an erasure-based privacy mechanism that offers perfect information-theoretic privacy by ensuring the released genomic sequence remains statistically independent of the sensitive genotypes.

The proposed mechanism is essentially a locally-optimal greedy algorithm that processes sequence positions sequentially. This approach evaluates utility based on maximizing the number of positions that can be released without erasure. Importantly, the authors demonstrate that identifying an optimal sequence order is an NP-hard problem. Nonetheless, they provide an upper bound for optimal utility and propose an efficient algorithm for implementing the mechanism in the context of sequences modeled by hidden Markov models (HMMs). The computational complexity of their implementation is polynomial in relation to sequence length, making it feasible for practical applications.

Critical analysis of privacy leakage emphasizes the robustness of the mechanism against discrepancies due to erroneous prior distributions. The mutual information constraint underpinning their privacy framework guarantees zero information leakage, achieving perfect privacy relative to the information theoretic model. This is a significant development in genomic privacy, particularly when compared to prior methods that either insufficiently guard privacy or necessitate excessive data erasure.

The findings yield significant implications for both theoretical advancements and practical implementations in AI related to genomic data sharing. From a theoretical standpoint, the presented method contributes to laying robust foundations for privacy-preserving frameworks in genomics. Practically, its algorithmic ingenuity may enable secure data sharing in personalized medicine and population-level studies without compromising individual privacy.

Simulation results affirm the mechanism's enhanced efficiency in maintaining a lower erasure rate while ensuring perfect privacy, in contrast to window-based erasure methods that are inadequate due to nontrivial levels of information leakage and increased data loss.

Looking forward, the challenge lies in extending the field of this privacy mechanism to accommodate more complex correlations beyond standard HMM frameworks while ensuring computational scalability. The NP-hard nature of determining the optimal sequence order presents an open problem, motivating further work on approximation algorithms that balance optimality with computational tractability. Additionally, exploring models that provide differential privacy guarantees may offer more nuanced privacy-utility trade-offs for users.

In conclusion, this paper offers an incisive delve into the formulation and solution of the genotype-hiding problem in genomics, employing information-theoretic principles to tackle privacy challenges in a domain encountering rapid growth and exceptional relevance. Such rigorous privacy mechanisms will be indispensable as personalized genomics continues to intersect with healthcare and research, necessitating sophisticated data privacy techniques.

PDF Markdown

Related Papers

YouTube

Show All Videos

Mechanisms for Hiding Sensitive Genotypes with Information-Theoretic Privacy (2007.05139v4)

Summary

Information-Theoretic Privacy in Genomic Data Sharing

Related Papers

YouTube