Data Shapley: Equitable Valuation of Data for Machine Learning (1904.02868v2)

Published 5 Apr 2019 in stat.ML, cs.AI, and cs.LG

Abstract: As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley value uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.

Citations (688)

View on Semantic Scholar

Summary

The paper introduces a principled, axiomatic framework that quantifies individual data contributions using the Shapley value method to ensure equitable data valuation.
It develops efficient approximation techniques, including Truncated Monte Carlo Shapley and Gradient-Based Shapley, that notably reduce computation while maintaining unbiased estimates.
Extensive experiments validate the approach by detecting low-quality data and improving model performance in tasks such as domain adaptation and image classification.

The paper presents a rigorous framework for assigning an equitable value to individual training data points in the context of supervised machine learning. It formulates the data valuation problem as one of quantifying the marginal contribution of each datum to the overall performance of a trained predictor. In this formulation, the training set, the learning algorithm, and the performance metric are the three pillars that determine the data’s value. The framework defines the value of a datum via a set of axioms analogous to those used in cooperative game theory, and shows that any valuation satisfying these axioms must take the form of the well‐known Shapley value.

The key aspects are summarized as follows:

Equitability Conditions:
- Null Contribution: If a datum does not improve performance when added to any subset, its value must be zero.
- Symmetry: If two data points contribute equally to any possible subset, they should receive identical value.
- Additivity: For performance metrics that decompose additively (or are sums over individual losses), the data’s value should decompose accordingly, i.e.,
- $\phi_i(V+W) = \phi_i(V) + \phi_i(W),$
- where $V$ and $W$ are performance scores.
- Here, $V$ denotes the performance function evaluating the predictor, and $\phi_i$ is the Shapley value for the $i$ -th data point.
Data Shapley Value Definition:

By mapping the supervised learning problem to a cooperative game, the paper shows that the unique solution, compliant with the properties above, is given by $\phi_i = C \sum_{S \subseteq D \setminus \{i\}} \frac{V(S \cup \{i\}) - V(S)}{\binom{n-1}{|S|}},$ where $C$ is an arbitrary constant, $D$ is the set of $n$ data points, and $V(S)$ is the performance when training on subset $S$ . This is equivalent to the classical Shapley value formulation [Shapley value, Shapley (1953, 1988)]. In this expression, the marginal contribution of data point $i$ across all subsets $S$ of the training data is weighted by combinatorial coefficients.

Computational Considerations and Approximations:
- Truncated Monte Carlo Shapley (TMC-Shapley):
- This algorithm randomly samples permutations of the data and computes the incremental performance gain when adding each datum. When the marginal gain falls below a preset performance tolerance (reflecting the intrinsic noise in test set evaluations), computation is truncated. This approach yields an unbiased estimator that converges to the true data value while drastically reducing complexity.
- Gradient-Based Shapley (G-Shapley):
- For differentiable loss functions and training algorithms based on stochastic gradient descent, the method approximates the marginal gain by performing a single gradient update per datum. This approximation is shown to correlate well with the TMC-Shapley estimates, especially in settings where one epoch suffices to capture the key learning dynamics.
Empirical Applications:
- Detecting Low-Quality or Mislabeled Data:
- In multiple experiments including spam detection, image classification (e.g., flower data and Fashion MNIST), and biomedical applications, data points with negative or negligible Shapley values were reliably indicative of mislabeled or noisy data. In comparison with the leave-one-out (LOO) method, data points ranked by Shapley value allowed for faster identification of corrupted entries.
- Domain Adaptation and Data Acquisition:
- By computing the data value with respect to a target performance metric, the method is extended to domain adaptation scenarios. For instance, in adapting an image classifier from a noisy, cheaply acquired dataset to high-quality dermoscopic images, reweighting or removing low-value training data led to substantial performance gains (e.g., increasing accuracy for a skin lesion task from 29.6% to 37.8%). Similarly, in gender detection tasks, the reweighting strategy improved accuracy from 84.1% to 91.5% on balanced evaluation sets.
- Group Valuation:
- The approach is also applied at the group level (e.g., patient cohorts or data collected from different centers in the UK Biobank). Notably, certain centers exhibited negative Shapley values in specific disease-prediction tasks, offering insights into distributional shifts and data quality differences in a multi-center paper.
Theoretical and Practical Implications:

The framework demonstrates that data valuation is inherently context-dependent—relying on the choice of learning algorithm and performance metric—and that data points are not uniformly valuable. This has significant implications in settings where individual compensation for data contribution is under consideration as well as in optimizing model performance by targeted data acquisition. The paper also discusses limitations, such as the inability of this metric to capture intrinsic qualities like privacy or personal association, and the dependency on the supervised learning paradigm.

Overall, the paper establishes a principled method for quantifying the value of data using the Shapley value concept, provides efficient algorithms for practical computation, and validates the approach with extensive experiments that report strong numerical correlations (e.g., high Pearson correlations between approximate and true Shapley values in synthetic settings, and notable improvements in domain adaptation tasks). The methodology and experimental results offer a robust, interpretable tool for data valuation in machine learning contexts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/aronchick/status/1822059033165017202

YouTube

Show All Videos