Marginal Value of Data Quality

Updated 9 November 2025

Marginal Value of Data Quality is defined as the incremental performance gain—such as accuracy or cost reduction—achieved by improving specific data quality metrics.
Key dimensions like completeness, label accuracy, and dataset size are rigorously quantified using derivatives and dual multipliers to guide optimization and decision-making.
Empirical studies reveal diminishing returns and domain-specific thresholds, providing actionable insights for prioritizing investments in data cleaning and quality enhancement.

The marginal value of data quality quantifies the instantaneous improvement in system performance—be it statistical accuracy, economic cost, or operational robustness—resulting from a small increase in one or more aspects of data quality. This concept formalizes how incremental investments or variations along defined data-quality axes (such as label accuracy, completeness, or distributional fidelity) translate into measurable gains for downstream objectives. Recent literature provides rigorous mathematical definitions, sensitivity formulas, and empirical studies across machine learning, optimization, and power systems that reveal sharply diminishing returns, domain-specific priorities, and fundamental limits to the benefit a marginal unit of improved data can deliver.

1. Mathematical Formalization of Data-Quality Marginal Value

Modern frameworks define precise, normalized metrics for each data-quality dimension and model the marginal value as the derivative of the downstream utility with respect to these metrics. For classification datasets, four core dimensions are commonly formalized (He et al., 2019):

Dataset Equilibrium ( $Q_{\mathrm{eq}}$ ): Quantifies label distribution balance, e.g., $Q_{\mathrm{eq}} = 1 - \frac{1}{2N}\sum_{i=1}^C |n_i - \mu|$ with $n_i$ samples per class, $\mu = N/C$ .
Dataset Size ( $Q_{\mathrm{size}}$ ): The fraction of maximal provided samples, e.g., $Q_{\mathrm{size}} = N/N_{\mathrm{max}}$ .
Label Quality ( $Q_{\mathrm{lbl}}$ ): Fraction of accurately labeled examples, $Q_{\mathrm{lbl}} = 1 - p$ , where $p$ is the mislabeling probability.
Contamination ( $Q_{\mathrm{cont}}$ ): One minus the normalized strength of noise or corruption.

The marginal value of data quality (MVQ) for system performance $\alpha$ , e.g., test accuracy, along dimension $Q$ is then given by the partial derivative $\frac{\partial \alpha}{\partial Q}$ evaluated at the current quality level.

Extensions to decision-theoretic and economic settings, such as power system optimization, rigorously embed data-quality (e.g., via Wasserstein-metric ambiguity balls of radius $\epsilon_f$ for each data provider) into the loss/objective and derive closed-form shadow prices $\mu_f = \frac{\partial J^*}{\partial \epsilon_f}$ via dual multipliers (Ghazanfariharandi et al., 19 Jun 2024, Mieth et al., 2023). For pointwise marginal value, influence-function analysis provides approximations for a data point's contribution to model loss: $s_i = g_i^\top H^{-1}g_i$ (Regneri et al., 2019).

2. Marginal Value Functions: Empirical and Theoretical Insights

Empirical studies consistently show that the marginal value of data quality is highly dimension- and regime-dependent, typically following a law of diminishing returns. Representative results for image classification (CIFAR-10) (He et al., 2019):

Dimension	$\partial\alpha/\partial Q$	Comments
Label Quality	+0.07	Steep accuracy cliff near $Q_{\mathrm{lbl}} \approx 0.8$
Dataset Equil.	+0.14	Deleting any class is costly
Dataset Size	+1.35 (at $Q_{\mathrm{size}}\approx0.2$ )	Most valuable at small $N$
Contamination	+0.01	Only marginal benefit for low $\tau$

For end-to-end ML pipelines using tabular data (2207.14529), the average marginal gains to performance when improving data quality (test-time “serving” data) are, for classification:

Data-Quality	Marginal Value $\Delta$ F1/ $\Delta Q$
Completeness	0.82
Feature Accuracy	0.80
Target Accuracy	0.85
Consistency	0.04
Uniqueness	0.03
Class Balance	0.10

These findings direct practitioners to prioritize completeness and accuracy improvements in test data to maximize F1 or regression $R^2$ .

In optimization under distributional ambiguity (multi-source DRO-OPF), the marginal value of improved data quality from provider $f$ is given by $\mu_f = \lambda_f^{\rm co} + \varphi^{\rm vol} \lambda_f^{\rm vol} + \sum_{n \in \text{cluster }f} \varphi_n^{\rm inv} \lambda_n^{\rm inv}$ , which exactly quantifies the cost savings per incremental reduction in ambiguity radius $\epsilon_f$ (Ghazanfariharandi et al., 19 Jun 2024, Mieth et al., 2023).

3. Data-Quality Dimensions and Practical Measurement

Recent work provides explicit definitions, pollution methods, and normalization for major data-quality axes:

Completeness: Fraction of non-missing data, typically induced by random masking.
Feature/Target Accuracy: Proportion of correct or noise-free feature/label entries, e.g., $FAcc = 1 - \text{(incorrect features)}/n$ .
Consistency: Degree of uniquely standardized categorical representation.
Class Balance and Equilibrium: Normalized measures of label skew or imbalance.
Uniqueness: Degree of row duplication.
Contamination: Synthetic noise, e.g., additive Gaussian or salt-and-pepper, quantified by normalized strength.
Distributional Fidelity: Distance (e.g., Wasserstein) between empirical and true distributions, parameterizing ambiguity in DRO formulations.

Pollution and cleaning protocols are carefully designed to allow controlled experiments tracking marginal return as $Q$ is varied from $0$ to $1$, thus facilitating empirical estimation of sensitivity functions $P(Q)$ .

4. Analytical and Algorithmic Frameworks for Marginal Valuation

Marginal value quantification relies on domain-specific methodologies:

Influence Functions and Pointwise Valuation: For a parametric loss $L(\theta; D)$ , removal of data point $x_i$ leads to an influence score $s_i = g_i^\top H^{-1}g_i$ , where $g_i=\nabla_\theta \ell(\theta;x_i)$ and $H$ is the empirical Hessian. $s_i$ directly ranks data for curation or pruning; negative or near-zero $s_i$ indicates redundancy or harm (Regneri et al., 2019).
Distributionally Robust Optimization (DRO): Marginal value of data quality in optimization is recovered as duality-based shadow prices on the Wasserstein-ball radii or ambiguity sets representing data uncertainty (Mieth et al., 2023, Ghazanfariharandi et al., 19 Jun 2024). Dual multipliers provide immediate quantification of welfare or cost gains per unit improvement in $\epsilon_j$ .
Expected Diameter for Data Quality: Data quality can be formalized via the expected diameter $E_D$ —the expected disagreement between hypotheses consistent with the data. Adding high-uncertainty points produces the maximal marginal drop in $E_D$ ; diminishing returns are precisely characterized as $O(1/(k+1))$ per new point (Raviv et al., 2020).
Temporal Decay: When data perish over time, valuation aligns with recency-weighted stock models. The marginal value for increasing data flow (adding "fresh" data) is $-\frac{\partial G_0}{\partial n}$ , where $G_0(n)$ is the test loss function. Adding old or drifted data can become harmful, with negative marginal value once its distribution diverges from the current target (Valavi et al., 2022).

5. Domain-Specific Case Studies and Quantitative Findings

Image Recognition and Classification

Experiments on MNIST and CIFAR-10 (He et al., 2019) show that:

Dataset Size: For $Q_{\mathrm{size}} < 0.3$ , accuracy drops precipitously; marginal gain is highest in low-data regimes ( $\partial\alpha/\partial Q_{\mathrm{size}} \approx 1.35$ at $s=0.2$ ), dropping to negligible levels as $s \to 1$ .
Label Quality: A threshold phenomenon at $Q_{\mathrm{lbl}} \approx 0.8$ produces a "cliff" in accuracy—further improvements in label quality beyond this yield diminishing returns, while dropping below it causes catastrophic failure.

Multi-Task Machine Learning

Analysis of 15 ML algorithms across 9 tabular datasets (2207.14529) quantifies marginal gains. For regression, completeness offers $\Delta R^2/\Delta \text{Completeness} \approx 1.6$ (serving data), while feature accuracy follows at $0.9$. Other axes, such as uniqueness and consistency, are an order of magnitude less impactful.

Data-Driven Optimization

In distributionally robust optimal power flow with multiple heterogeneous data providers (Ghazanfariharandi et al., 19 Jun 2024, Mieth et al., 2023):

Marginal cost savings per unit improvement in source $f$ 's data quality is exactly $\mu_f$ .
Empirical case studies show that as $\epsilon_f$ decreases (i.e., higher quality), cost decreases sharply up to a threshold and then plateaus. Clusters with high PV capacity or electrically remote nodes have the largest $\mu_f$ , indicating where investments in data quality are most effective.

State Estimation and Energy Markets

Grid and market robustness against adversarial data corruption is parameterized by an energy threshold $\epsilon$ for undetectable bad-data vectors (Jia et al., 2012). The marginal value of tightening $\epsilon$ is the local sensitivity of worst-case price perturbation, $MV = \frac{d}{d\epsilon} \Delta\lambda^*(\epsilon)$ .

Temporal Data Perishability

In lived business scenarios, older data's value decays exponentially in drift distance to the current distribution (Valavi et al., 2022). After seven years, the effective value of 100MB of text data drops to approximately 50MB of current data for language modeling. The optimal data stock is where marginal accuracy gain equals marginal cost of data flow; retaining old/outdated data beyond this point may even harm performance.

6. Prioritization, Diminishing Returns, and Operational Guidelines

Unified findings from empirical and theoretical studies produce clear operational principles:

Prioritize completeness and accuracy—marginal value per unit investment is highest for completeness and (feature/target) accuracy, especially in serving data, with up to $1.6$ $R^2$ points per 0.1 improvement (2207.14529).
Focus on small, poorly performing $Q$ —the steepest marginal gains are at the low end of data size and label quality; focus on the most deficient metric for maximal effect (He et al., 2019).
Balance stock and flow for nonstationary data—maximize the flow of recent, relevant data rather than accruing a large, outdated archive (Valavi et al., 2022).
Leverage dual sensitivities—use dual multipliers from DRO formulations to guide investment in data cleaning, acquisition, or privacy relaxation (Mieth et al., 2023, Ghazanfariharandi et al., 19 Jun 2024).
Defer low-priority improvements—uniqueness, representation standardization, and moderate imbalances have minimal marginal effect relative to completeness and accuracy (2207.14529).

Table: Representative Marginal Value Sensitivities (Exemplars)

Domain	Quality Dimension	Marginal Value (MVQ)	Source
Classification	Completeness	$+0.82$ F1 per $\Delta Q$	(2207.14529)
Regression	Completeness	$+1.60$ $R^2$ per $\Delta Q$	(2207.14529)
Classification	Size (CIFAR-10)	$+1.35$ acc. at $s=0.2$	(He et al., 2019)
Power Systems	$\epsilon_f$ (Wass.)	$\mu_f=\lambda_f^{co}+\ldots$	(Ghazanfariharandi et al., 19 Jun 2024)
Data Perishability	Age (text data)	$-0.10$ effective/year	(Valavi et al., 2022)

7. Limitations and Future Directions

Current methodologies assume that quality axes are independent or can be orthogonalized; in practice, interaction effects may exist (e.g., imputed incompleteness and label noise). Most empirical studies focus on tabular or image data; generalization to modalities such as language, graphs, or streaming data remains an active topic (2207.14529). The precise marginal utility may also depend on the ML model's regularization, pipeline stochasticity, and even domain-specific deployment costs.

This suggests further development of adaptive data-quality investment tools, finer-grained quality metrics, and broader cross-domain validation to robustly operationalize marginal value calculations in production systems.