Label Differential Privacy via Aggregation (2310.10092v3)
Abstract: In many real-world applications, due to recent developments in the privacy landscape, training data may be aggregated to preserve the privacy of sensitive training labels. In the learning from label proportions (LLP) framework, the dataset is partitioned into bags of feature-vectors which are available only with the sum of the labels per bag. A further restriction, which we call learning from bag aggregates (LBA) is where instead of individual feature-vectors, only the (possibly weighted) sum of the feature-vectors per bag is available. We study whether such aggregation techniques can provide privacy guarantees under the notion of label differential privacy (label-DP) previously studied in for e.g. [Chaudhuri-Hsu'11, Ghazi et al.'21, Esfandiari et al.'22]. It is easily seen that naive LBA and LLP do not provide label-DP. Our main result however, shows that weighted LBA using iid Gaussian weights with $m$ randomly sampled disjoint $k$-sized bags is in fact $(\varepsilon, \delta)$-label-DP for any $\varepsilon > 0$ with $\delta \approx \exp(-\Omega(\sqrt{k}))$ assuming a lower bound on the linear-mse regression loss. Further, the $\ell_22$-regressor which minimizes the loss on the aggregated dataset has a loss within $\left(1 + o(1)\right)$-factor of the optimum on the original dataset w.p. $\approx 1 - exp(-\Omega(m))$. We emphasize that no additive label noise is required. The analogous weighted-LLP does not however admit label-DP. Nevertheless, we show that if additive $N(0, 1)$ noise can be added to any constant fraction of the instance labels, then the noisy weighted-LLP admits similar label-DP guarantees without assumptions on the dataset, while preserving the utility of Lipschitz-bounded neural mse-regression tasks. Our work is the first to demonstrate that label-DP can be achieved by randomly weighted aggregation for regression tasks, using no or little additive noise.
- Easy learning from label proportions. arXiv, 2023.
- K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, pages 155–186. JMLR Workshop and Conference Proceedings, 2011.
- Learning from aggregated data: Curated bags versus random bags. arXiv, 2023.
- Cost-based labeling of groups of mass spectra. In Proc. ACM SIGMOD International Conference on Management of Data, pages 167–178, 2004.
- N. de Freitas and H. Kück. Learning about individuals from group statistics. In Proc. UAI, pages 332–339, 2005.
- Weakly supervised classification in high energy physics. Journal of High Energy Physics, 2017(5):1–11, 2017.
- Deep multi-class learning from label proportions. CoRR, abs/1905.12909, 2019.
- C. Dwork. Differential privacy. In Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II 33, pages 1–12. Springer, 2006.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- Label differential privacy via clustering. In International Conference on Artificial Intelligence and Statistics, pages 7055–7075. PMLR, 2022.
- Deep learning with label differential privacy. Advances in neural information processing systems, 34:27131–27145, 2021.
- Regression with label differential privacy. arXiv preprint arXiv:2212.06074, 2022.
- Deep learning. MIT press, 2016.
- D. Gross and V. Nesme. Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010.
- Learning bayesian network classifiers from label proportions. Pattern Recognit., 46(12):3425–3440, 2013.
- W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- C. S. J. Zhang, Y. Wang. Learning from label proportions by learning with label noise. In Proc. NeurIPS, 2022.
- From group to individual labels using deep features. In Proc. SIGKDD, pages 597–606, 2015.
- Learning from label proportions with generative adversarial networks. In Proc. NeurIPS, pages 7167–7177, 2019.
- Antipodes of label differential privacy: Pate and alibi. Advances in Neural Information Processing Systems, 34:6934–6945, 2021.
- Supervised learning by training on aggregate outputs. In Proc. ICDM, pages 252–261. IEEE Computer Society, 2007.
- Domain-agnostic contrastive representations for learning from label proportions. In Proc. CIKM, pages 1542–1551, 2022.
- Challenges and approaches to privacy preserving post-click conversion prediction. arXiv preprint arXiv:2201.12666, 2022.
- (almost) no label no cry. In Proc. Advances in Neural Information Processing Systems, pages 190–198, 2014.
- Estimating labels from label proportions. J. Mach. Learn. Res., 10:2349–2374, 2009.
- S. Rueping. SVM classifier estimation from group probabilities. In Proc. ICML, pages 911–918, 2010.
- R. Saket. Learnability of linear thresholds from label proportions. In Proc. NeurIPS, 2021.
- R. Saket. Algorithms and hardness for learning linear thresholds from label proportions. In Proc. NeurIPS, 2022.
- On combining bags to better learn from label proportions. In AISTATS, volume 151 of Proceedings of Machine Learning Research, pages 5913–5927. PMLR, 2022.
- C. Scott and J. Zhang. Learning from label proportions: A mutual contamination framework. In Proc. NeurIPS, 2020.
- Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1161–1170, 2019.
- R. G. J. G. E. M. J. P. Steven Ruggles, Sarah Flood and M. Sobek. IPUMS USA: Version 8.0 Extract of 1940 Census for U.S. Census Bureau Disclosure Avoidance Research [dataset]. minneapolis, MN: IPUMS, 2018.
- M. Tallis and P. Yadav. Reacting to variations in product demand: An application for conversion rate (CR) prediction in sponsored search. In 2018 IEEE International Conference on Big Data (Big Data), pages 1856–1864. IEEE, 2018.
- J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12:389–434, 2012.
- R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
- Using published medical results and non-homogenous data in rule learning. In Proc. International Conference on Machine Learning and Applications and Workshops, volume 2, pages 84–89. IEEE, 2011.
- On learning from label proportions. CoRR, abs/1402.5902, 2014.
- ∝proportional-to\propto∝SVM for learning with label proportions. In Proc. ICML, volume 28, pages 504–512, 2013.