A Submodularity-based Agglomerative Clustering Algorithm for the Privacy Funnel (1901.06629v2)
Abstract: For the privacy funnel (PF) problem, we propose an efficient iterative agglomerative clustering algorithm based on the minimization of the difference of submodular functions (IAC-MDSF). For a data curator that wants to share the data $X$ correlated with the sensitive information $S$, the PF problem is to generate the sanitized data $\hat{X}$ that maintains a specified utility/fidelity threshold on $I(X; \hat{X})$ while minimizing the privacy leakage $I(S; \hat{X})$. Our IAC-MDSF algorithm starts with the original alphabet $\hat{\mathcal{X}} := \mathcal{X}$ and iteratively merges the elements in the current alphabet $\hat{\mathcal{X}}$ that minimizes the Lagrangian function $ I(S;\hat{X}) - \lambda I(X;\hat{X}) $. We prove that the best merge in each iteration of IAC-MDSF can be searched efficiently over all subsets of $\hat{\mathcal{X}}$ by the existing MDSF algorithms. We show that the IAC-MDSF algorithm also applies to the information bottleneck (IB), a dual problem to PF. By varying the value of the Lagrangian multiplier $\lambda$, we obtain the experimental results on a heart disease data set in terms of the Pareto frontier: $ I(S;\hat{X})$ vs. $- I(X;\hat{X})$. We show that our IAC-MDSF algorithm outperforms the existing iterative pairwise merge approaches for both PF and IB and is computationally much less complex.