Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation (2308.15709v2)

Published 30 Aug 2023 in cs.LG, cs.CR, cs.GT, and stat.ML

Abstract: Data valuation aims to quantify the usefulness of individual data sources in training ML models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiachen T. Wang (24 papers)
  2. Yuqing Zhu (34 papers)
  3. Yu-Xiang Wang (124 papers)
  4. Ruoxi Jia (88 papers)
  5. Prateek Mittal (129 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.