Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Inflation of KNN-Shapley Value (2405.17489v1)

Published 25 May 2024 in cs.LG and cs.AI

Abstract: Shapley value-based data valuation methods, originating from cooperative game theory, quantify the usefulness of each individual sample by considering its contribution to all possible training subsets. Despite their extensive applications, these methods encounter the challenge of value inflation - while samples with negative Shapley values are detrimental, some with positive values can also be harmful. This challenge prompts two fundamental questions: the suitability of zero as a threshold for distinguishing detrimental from beneficial samples and the determination of an appropriate threshold. To address these questions, we focus on KNN-Shapley and propose Calibrated KNN-Shapley (CKNN-Shapley), which calibrates zero as the threshold to distinguish detrimental samples from beneficial ones by mitigating the negative effects of small-sized training subsets. Through extensive experiments, we demonstrate the effectiveness of CKNN-Shapley in alleviating data valuation inflation, detecting detrimental samples, and assessing data quality. We also extend our approach beyond conventional classification settings, applying it to diverse and practical scenarios such as learning with mislabeled data, online learning with stream data, and active learning for label annotation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Advances, challenges and opportunities in creating data for trustworthy ai. Nature Machine Intelligence, 2022.
  2. Data-centric ai: Perspectives and challenges. In SIAM International Conference on Data Mining, 2023.
  3. Outlier detection: Methods, models, and classification. ACM Computing Surveys, 2020.
  4. Meta label correction for noisy label learning. In AAAI Conference on Artificial Intelligence, 2021.
  5. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 2020.
  6. Achieving fairness at no utility cost via data reweighing with influence. In International Conference on Machine Learning, 2022.
  7. Fair clustering using antidote data. In Algorithmic Fairness through the Lens of Causality and Robustness Workshop, 2022.
  8. Learning antidote data to individual unfairness. In International Conference on Machine Learning, 2023.
  9. A survey on active learning: State-of-the-art, practical challenges and research directions. Mathematics, 2023.
  10. Influence selection for active learning. In IEEE/CVF International Conference on Computer Vision, 2021.
  11. Residuals and influence in regression. New York: Chapman and Hall, 1982.
  12. Lloyd S Shapley et al. A value for n-person games. Princeton University Press Princeton, 1953.
  13. Alvin E Roth. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press, 1988.
  14. Sebastian Bordt and Ulrike von Luxburg. From shapley values to generalized additive models and back. In International Conference on Artificial Intelligence and Statistics, 2023.
  15. Faith-shap: The faithful shapley interaction index. Journal of Machine Learning Research, 2023.
  16. The shapley taylor interaction index. In International Conference on Machine Learning, 2020.
  17. Training data influence analysis and estimation: A survey. arXiv preprint arXiv:2212.04612, 2022.
  18. Towards efficient data valuation based on the shapley value. In International Conference on Artificial Intelligence and Statistics, 2019.
  19. Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification? In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  20. Opendataval: a unified benchmark for data valuation. Advances in Neural Information Processing Systems, 2023.
  21. Feature selection based on the shapley value. other words, 1(98Eqr):155, 2005.
  22. A feature selection method based on shapley value to false alarm reduction in icus a genetic-algorithm approach. In International Conference of the IEEE Engineering in Medicine and Biology Society, 2018.
  23. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 2017.
  24. Improving kernelshap: Practical shapley value estimation via linear regression. arXiv preprint arXiv:2012.01536, 2020.
  25. Fastshap: Real-time shapley value estimation. In International Conference on Learning Representations, 2021.
  26. The shapley value of classifiers in ensemble games. In International Conference on Information & Knowledge Management, 2021.
  27. Trustworthy machine learning for health care: scalable data valuation with the shapley value. In Conference on Health, Inference, and Learning, 2021.
  28. Data valuation for medical imaging using shapley value and application to a large-scale chest x-ray dataset. Scientific reports, 11(1):8366, 2021.
  29. Data valuation for vertical federated learning: An information-theoretic approach. arXiv preprint arXiv:2112.08364, 2021.
  30. Collaborative machine learning with incentive-aware model rewards. In International conference on machine learning, 2020.
  31. Explanations for data repair through shapley values. In International Conference on Information & Knowledge Management, 2021.
  32. Cs-shapley: class-wise shapley values for data valuation in classification. Advances in Neural Information Processing Systems, 35:34574–34585, 2022.
  33. A distributional framework for data valuation. In International Conference on Machine Learning, 2020.
  34. Beta Shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, 2022.
  35. Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, 2023.
  36. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, 2019.
  37. Efficient sampling approaches to shapley value approximation. ACM on Management of Data.
  38. Improving cooperative game theory-based data valuation via data utility learning. arXiv preprint arXiv:2107.06336, 2021.
  39. On shapley value in data assemblage under independent utility. arXiv preprint arXiv:2208.01163, 2022.
  40. Efficient task-specific data valuation for nearest neighbor algorithms. International Conference on Very Large Data Bases Endowment, 2019.
  41. A note on" efficient task-specific data valuation for nearest neighbor algorithms". arXiv preprint arXiv:2304.04258, 2023.
  42. Efficient data shapley for weighted nearest neighbor algorithms. arXiv preprint arXiv:2401.11103, 2024.
  43. Threshold knn-shapley: A linear-time and privacy-friendly approach to data valuation. Advances in Neural Information Processing Systems, 2023.
  44. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing, 2013.
  45. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  46. Ken Lang. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings. 1995.
  47. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
  48. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  49. Entropy-based active learning for object recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2008.
  50. Margin based active learning. In Annual Conference on Learning Theory, 2007.
  51. How to measure uncertainty in uncertainty sampling for active learning. Machine Learning, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com