Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms (1908.08619v4)

Published 22 Aug 2019 in cs.LG and stat.ML

Abstract: Given a data set $\mathcal{D}$ containing millions of data points and a data consumer who is willing to pay for \$$X$ to train a ML model over $\mathcal{D}$, how should we distribute this \$$X$ to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all $N$ data points, it requires $O(2N)$ model evaluations for exact computation and $O(N\log N)$ for $(\epsilon, \delta)$-approximation. In this paper, we focus on one popular family of ML models relying on $K$-nearest neighbors ($K$NN). The most surprising result is that for unweighted $K$NN classifiers and regressors, the Shapley value of all $N$ data points can be computed, exactly, in $O(N\log N)$ time -- an exponential improvement on computational complexity! Moreover, for $(\epsilon, \delta)$-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity $O(N{h(\epsilon,K)}\log N)$ when $\epsilon$ is not too small and $K$ is not too large. We empirically evaluate our algorithms on up to $10$ million data points and even our exact algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further. We then extend our algorithms to other scenarios such as (1) weighed $K$NN classifiers, (2) different data points are clustered by different data curators, and (3) there are data analysts providing computation who also requires proper valuation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ruoxi Jia (88 papers)
  2. David Dao (13 papers)
  3. Boxin Wang (28 papers)
  4. Frances Ann Hubis (3 papers)
  5. Bo Li (1107 papers)
  6. Ce Zhang (215 papers)
  7. Costas J. Spanos (28 papers)
  8. Dawn Song (229 papers)
  9. Nezihe Merve Gurel (2 papers)
Citations (177)

Summary

Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

The paper "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms" addresses the computational challenges associated with quantifying the relative value of data using the Shapley value, focusing specifically on KK-nearest neighbor (KKNN) models. The authors propose a series of efficient algorithms to compute the Shapley value, which is a well-regarded method for fair distribution of gains in cooperative game theory, yet has traditionally been computationally infeasible for large datasets due to its exponential complexity.

Key Contributions

Exact Shapley Value Computation for KKNN Classifiers

The authors present a novel approach to exactly calculate Shapley values for unweighted KKNN classifiers with a computational complexity of O(NlogN)O(N\log N), a significant improvement over the typical O(2N)O(2^N) evaluations required by general methods. This method leverages the ordered nature of training data relative to a test point, enabling recursive computation of the Shapley values.

Sublinear Approximation via Locality Sensitive Hashing (LSH)

Beyond exact computation, the paper introduces an algorithm that approximates Shapley values for KKNN classifiers with sublinear complexity using Locality Sensitive Hashing (LSH). This algorithm reduces the number of required nearest neighbors to approximate the Shapley value, offering substantial computational savings while retaining accurate value estimates under specific error constraints.

Extensions to Other Scenarios

The paper extends these valuation methodologies to a variety of scenarios:

  1. Weighted KKNN Models: For weighted KKNN, the exact computation remains feasible but less practical for large KK, prompting the need for efficient approximation algorithms.
  2. Multiple Data Points Per Contributor: The algorithms are adapted to consider multiple data points contributed by individual players in the data marketplace, maintaining efficiency while ensuring fairness.
  3. Valuing Computation Contributions: Integration of data analytics and computational contributions into the Shapley value framework allows for equitable valuation of computation resources within this cooperative setup.

Implications and Future Directions

The proposed methods present a breakthrough in scalable cooperative game-theoretic data valuation, particularly applicable in large-scale AI systems where nearest neighbor approaches are prevalent. The ability to accurately compute Shapley values with reduced computational requirements opens up possibilities for more dynamic and fair data marketplaces, motivating further exploration into efficient data valuation for other machine learning models beyond KKNN.

The authors demonstrate that these methods not only align with theoretical expectations but also coincide with intuitive notions of data value seen in empirical studies. This sets the stage for practical implementations in privacy-preserving data markets and potentially broadening the scope to encompass deep learning models using the KKNN approaches for surrogate evaluation. Future work will integrate these algorithms into real-world data markets, enhancing transparency and fairness in data-driven economic exchanges.