Influence-Based Data Valuation
- Influence-based data valuation is a framework that assigns utility scores to individual training examples by measuring their marginal impact on model predictions and loss.
- It combines statistical influence functions, gradient approximations, and cooperative game theory to enable tasks like data pruning, noise detection, and market pricing.
- Recent advances introduce scalable projection methods and privacy-preserving techniques, enhancing its applicability in federated learning and large-scale optimization.
Influence-based data valuation quantifies the utility or impact of each data instance in supervised learning, typically by measuring the marginal effect of individual data on a model’s predictions, generalization, parameter trajectory, or other outcomes of interest. This paradigm synthesizes tools from statistical influence functions, cooperative game theory (Shapley value), information geometry, and modern large-scale optimization, aiming to assign per-example (or per-batch/client) value scores. These scores underlie core tasks such as pruning, noise detection, data market pricing, federated aggregation, debugging, and privacy-preserving data exchange.
1. Conceptual Foundations of Influence-Based Data Valuation
Central to influence-based data valuation is the principle that datasets are not homogeneous in their contribution to model training or downstream utility. Data instances vary in their informativeness, redundancy, noisiness, adversarial impact, or even adversarial potential. Influence quantifies this heterogeneity using a variety of mathematical and operational definitions.
Classic influence functions, grounded in robust statistics, measure the effect of upweighting or removing a data point on the learned parameter or the downstream test loss. For empirical risk minimization , the influence of training point on a held-out point is
where is the Hessian of the empirical loss at . This “first-order” influence can be seen as a local linearization of the leave-one-out retraining effect, and is central to much of the recent literature (Choe et al., 22 May 2024, Yang et al., 4 Dec 2025, Pan et al., 2 Mar 2025, Agarwal et al., 14 Feb 2025, Xu et al., 2020).
Extensions and alternatives include leave-one-out complexity measures (e.g., the complexity-gap score (Ki et al., 2023)), geometric measures such as leverage scores (Mendoza-Smith, 3 Nov 2025), linearized future influence kernels (Pan et al., 2 Mar 2025), and distributional influence via MMD-based functionals (Zhu et al., 30 Jun 2025).
2. Core Methodological Variants and Mathematical Formulation
Influence-based data valuation frameworks diverge in their choice of utility function, level of model access/assumptions, and computational tractability. Key methods include:
- Classic Influence Functions: Rely on the Taylor expansion of the parameter optimum and test loss under infinitesimal data perturbation (Choe et al., 22 May 2024, Xu et al., 2020, Agarwal et al., 14 Feb 2025). These typically require (approximate) Hessian inversion.
- Gradient-dot-product Variants: Replace costly second-order terms by first-order approximations; e.g., TracIn, LinFiK, and the LoGra projection use
with projections to reduce computational overhead (Pan et al., 2 Mar 2025, Deng et al., 13 Aug 2025, Choe et al., 22 May 2024).
- Complexity-based Scores: The complexity-gap (CG) score compares leave-one-out NTK-based complexity (Ki et al., 2023): where is the NTK Gram matrix.
- Kernel-based Distributional Influence: KAIROS computes a distributional influence functional for each using the MMD between noisy training and a clean reference set (Zhu et al., 30 Jun 2025):
- Federated/Client Influence: In cross-silo or cross-device FL, FedIF and related methods quantify the contribution of each client’s update by its alignment with a public validation gradient (Tang et al., 29 Sep 2025).
A summary table of core paradigms:
| Method | Valuation Principle | Main Computation |
|---|---|---|
| Influence Function | First-order test loss change | |
| LoGra/TracIn/LinFiK | Gradient inner product, projected space | |
| Complexity-Gap (CG) | Leave-one-out NTK complexity increase | Closed-form Schur complement formula |
| KAIROS | Kernel-mean discrepancy (distributional) | Kernel mean differences (no models needed) |
| Leverage Score | Span increase in feature space | |
| For-Value | Hidden-representation + error alignment | Forward pass only, inner products over embeddings |
3. Computational Strategies and Scaling Solutions
Influence-based approaches historically suffered from prohibitive cost: either full Hessian inversion for , , or repeated retraining (leave-one-out, Shapley value) for combinatorial .
Recent solutions address these bottlenecks:
- Low-rank/Structured Projection: LoGra (Choe et al., 22 May 2024), TIP (Yang et al., 4 Dec 2025), and other scalable variants exploit model structure, storing only projected per-example gradients (e.g., via LoRA adapters). This reduces both memory footprint and compute from to (), with empirical throughput gains up to 6,500 at LLM scale.
- Forward-only Algorithms: For-Value (Deng et al., 13 Aug 2025) eliminates backprop entirely by re-expressing influence as weighted hidden-state inner products and prediction-error covariation, broadening feasibility to models with restricted access or frozen weights.
- Neural Approximation/Distillation: Tiny neural networks (NN-CIFT (Agarwal et al., 14 Feb 2025), ALinFiK (Pan et al., 2 Mar 2025)) are trained to regress or classify influence values over large pools, using a small sampling of true influence values for supervision, yielding -- speedups at minimal accuracy cost.
- Closed-form/Streaming Updates: KAIROS (Zhu et al., 30 Jun 2025) achieves batch computation but supports online increments when new data arrive, with ranking errors.
- Encrypted/Private Valuation: The Trustworthy Influence Protocol (TIP) (Yang et al., 4 Dec 2025), as well as MPC-based forward influence (Xu et al., 2020), implement the full first-order pipeline under homomorphic encryption or secret sharing, enabling secure third-party appraisal and data sale.
4. Empirical Performance, Use Cases, and Limitations
A diverse set of empirical results uniformly demonstrates that influence-based scores strongly correlate with the true impact of data on test loss, robustness, or adversarial outcomes:
- Pruning and Compression: CG and influence-based methods enable removal of up to 40% of training samples (CIFAR-10) with performance drop; inverse pruning of high-influence samples severely degrades performance (Ki et al., 2023, Choe et al., 22 May 2024).
- Noise/Mislabeled Data Detection: CG, For-Value, Diff-In, and KAIROS excel at identifying and ranking corrupted or anomalous points; precision of mislabeled point detection approaches 100% in several vision and NLP benchmarks (Ki et al., 2023, Deng et al., 13 Aug 2025, Tan et al., 20 Aug 2025, Zhu et al., 30 Jun 2025).
- Data Market and Pricing: Influence scores reveal heavy-tailed distributions of value (e.g., top 5% of books contributing most positive utility in GPT-2 pretraining (Yang et al., 4 Dec 2025)), challenging flat-rate compensation in data markets.
- Federated Aggregation: Influence-based scores enable robust aggregation in FL under noise/adversary conditions, outperforming Shapley-based schemes (FedIF achieves 450 aggregation speedup (Tang et al., 29 Sep 2025)).
- Active Learning and Subset Selection: Leverage scores, KAIROS, and influence-based coreset methods consistently exceed random or heuristic batch selection in downstream accuracy, both in convex and neural settings (Mendoza-Smith, 3 Nov 2025, Zhu et al., 30 Jun 2025).
Limitations arise in non-convexity (some methods assume convex losses or two-layer NTK structure, e.g., (Ki et al., 2023)), approximation error under high-order or long-horizon dynamics (addressed by Diff-In (Tan et al., 20 Aug 2025)), and cost of full-gradient or kernel computation in extremely high-dimensional or massive data regimes.
5. Model-Agnostic and Geometric Alternatives
A growing body of work emphasizes valuation without model gradients or even explicit model fitting:
- Geometric Leverage Scores: These assign each sample a share of the dataset’s span or effective dimension in feature space, satisfying dummy, efficiency, and symmetry axioms of Shapley valuation. Ridge-leverage sampling provides guarantees on parameter and risk closeness to full retraining (Mendoza-Smith, 3 Nov 2025).
- Kernel-MMD and Conditional MMD: KAIROS leverages the MMD (and MCMD) functional between a noisy distribution and a clean reference set, yielding closed-form influence scores without model training or gradients. These admit strong theoretical guarantees (symmetry, density separation) and extend naturally to label-conditional shifts (Zhu et al., 30 Jun 2025).
- Nearest-Neighbor Shapley: For KNN models, soft-label KNN-SV computes data Shapley values efficiently with closed-form recursions, and locality-sensitive hashing enables sublinear approximate solutions at scale (Wang et al., 2023).
6. Privacy-Preserving and Federated Settings
Influence-based valuation methods are increasingly deployed under privacy constraints:
- Homomorphic Encryption and MPC: TIP (Yang et al., 4 Dec 2025) and forward-influence MPC (Xu et al., 2020) show that high-fidelity influence scores can be computed across buyer-seller boundaries in encrypted form, maintaining privacy while enabling trusted transactions.
- Federated Influence: In federated learning, client contributions are measured by the alignment of local update directions and a public reference gradient, combined with normalization and smoothing to resist adversarial and noisy participant updates (Tang et al., 29 Sep 2025).
7. Theoretical Guarantees, Axiomatic Properties, and Open Directions
Recent research grounds influence-based scores in formal bounds or axiomatic guarantees:
- Approximation to Leave-One-Out Utility: KAIROS and geometric leverage scores approximate the true leave-one-out ranking within , and achieve fair data pricing via axioms inherited from Shapley theory (Zhu et al., 30 Jun 2025, Mendoza-Smith, 3 Nov 2025).
- Tight Loss Change Bounds: Influence-weighted federated averaging yields provably tighter upper bounds on one-step global loss change compared to uniform aggregation (FedAvg), under Lipschitz and bounded-dissimilarity assumptions (Tang et al., 29 Sep 2025).
- Consistency and Robustness: Diff-In provides a second-order influence approximation with bounded, polynomial error in the number of steps, outperforming first-order dynamic approximations under non-convex training dynamics (Tan et al., 20 Aug 2025).
- Open Challenges: Strong theoretical guarantees in deep, highly non-convex regimes (beyond the two-layer NTK or locally-linear settings) remain a frontier, as do extensions to batch/group influence, continual model updates, robust valuation under distribution shift, and complex privacy/adversary models (Ki et al., 2023, Yang et al., 4 Dec 2025).
In summary, influence-based data valuation now encompasses a spectrum of theoretically grounded, computationally efficient, and empirically validated methods tailored for the demands of large-scale, privacy-sensitive, and heterogeneous ML and data markets.