Neural Dynamic Data Valuation

Updated 16 November 2025

NDDV is a dynamic framework that assigns scalable, per-sample values using optimal control principles integrated into neural network training.
It employs neural surrogate models and adjoint-based methods to efficiently approximate influence scores and capture data utility.
The framework enhances model performance and fairness through bilevel optimization and re-weighting strategies, addressing computational scaling and data heterogeneity.

Neural Dynamic Data Valuation (NDDV) is a methodological framework for quantifying the importance of individual data points to the performance and generalization of machine learning models, particularly neural networks. Distinct from classical retraining-based or marginal contribution methods, NDDV frames data valuation as a dynamic process embedded within model training or as a control-theoretic or neural surrogate problem, allowing scalable, efficient, and theoretically grounded assignment of per-sample values. This approach leverages the sensitivity of model states, adjoint dynamics, or learned surrogate functions for data valuation, and has been instantiated in various neural architectures, learning pipelines, and data selection tasks.

1. Mathematical Foundations and Optimal Control Formulation

NDDV recasts data valuation as an optimal control problem in continuous or discrete time, establishing a dynamics-based framework for assigning value to each data point. The evolution of the state for each data point $i$ is defined through a controlled stochastic differential equation: $dX_{i,t} = b(X_{i,t}, \mu_t, \psi_{i,t})\,dt + \sigma\,dW_{i,t},$ where $X_{i,t}$ is the state, $\psi_{i,t}$ is a control parameter (interpreted as a data weight), $\mu_t$ is the empirical population mean, and $W_{i,t}$ denotes Brownian motion. The typical linear-quadratic drift is $b(X_{i,t},\mu_t, \psi_{i,t}) = a(\mu_t - X_{i,t}) + \psi_{i,t}$ with $a > 0$ .

The NDDV objective seeks to minimize the population-averaged cost: $L(\psi) = \mathbb{E} \left[ \int_0^T R(X_t, \mu_t, \psi_t)\,dt + \mathcal{V}(\Phi(X_T,\mu_T);\theta) \cdot \Phi(X_T, \mu_T) \right],$ where $R$ is the running cost, $\Phi$ the terminal cost, and $\mathcal{V}(\cdot;\theta)$ is a learnable meta-weighting network for fairness and heterogeneity.

The stochastic maximum principle yields a coupled forward-backward SDE system:

Forward: integrates state trajectories.
Backward: computes adjoints $Y_{i,t}$ via

$dY_{i,t} = -\nabla_x \mathcal{H}_i\,dt + Z_{i,t}\,dW_{i,t},$

where the Hamiltonian $\mathcal{H}_i$ captures control, cost, and adjoint interactions.

For each sample, the dynamic utility is defined as $U_i = -X_{i,T} \cdot Y_{i,T}$ . The marginal value of data point $i$ is

$\phi(x_i, y_i; U) = U_i - \frac{1}{N-1} \sum_{j\neq i} U_j,$

which satisfies properties such as efficiency, symmetry, dummy, additivity, and marginalism (Liang et al., 30 Apr 2024, Liang et al., 9 Nov 2025).

2. Neural Surrogate Architectures and Dynamic Inference

NDDV has been realized as neural surrogate models that learn to approximate expensive influence function computations or importance scores. The NN-CIFT approach ("Neural Networks for effiCient Instruction Fine-Tuning") (Agarwal et al., 14 Feb 2025) employs a compact "InfluenceNetwork" to regress pairwise influence metrics.

InfluenceNetwork Architecture

Embedding: Precompute $\mathrm{emb}(z) \in \mathbb{R}^{1024}$ using BGE embeddings.
Input: For pair $(i,j)$ , concatenate embeddings, yielding $h_0 \in \mathbb{R}^{2048}$ .
Hidden: Two dense layers of width 100, ReLU activations.
Output: Scalar influence score $\hat\phi = IN_\theta(i,j) \in [0,1]$ .
Parameters: $\approx 204,900$ ( $\sim$ 0.0027% size of a 7B LLM).

Training

Train on a small fraction $u\%$ of data pairs with "ground-truth" labels from an expensive influence metric (e.g., DELIFT).
Minimize mean squared error:

$\mathcal{L}(\theta) = \frac{1}{|\mathcal{T}|} \sum_{(i,j)\in\mathcal{T}} \left( IN_\theta(i,j) - \phi(i,j) \right)^2 + \lambda \|\theta\|_2^2,$

with Adam optimizer, 20 epochs, learning rate $10^{-4}$ .

Once trained, the network infers influence values for all pairs via efficient forward passes, supporting dynamic, on-the-fly data valuation as new samples arrive (Agarwal et al., 14 Feb 2025).

3. Data Re-Weighting, Fairness, and Meta-Optimization

To handle heterogeneous data and encourage fairness, NDDV employs a meta-weighting network $\mathcal{V}(\cdot;\theta)$ that re-weights the terminal cost for each sample, adapting the effect of each point on the global objective (Liang et al., 30 Apr 2024, Liang et al., 9 Nov 2025). This forms a bilevel optimization:

Inner level: optimize $\psi^*(\theta) = \arg\min_\psi L(\psi;\theta)$ .
Outer level: update $\theta^* = \arg\min_\theta \ell_\text{meta}(\psi^*(\theta))$ , where $\ell_\text{meta}$ is a holdout or validation loss.

Optimization alternates gradient steps on $\psi$ (data weights/controls) and $\theta$ (meta-network), using forward/backward SDE integrations and backpropagation through the Hamiltonian. This yields adaptive, context-sensitive data values and ensures data points with harmful contributions are down-weighted.

4. Algorithmic Pipelines and Computational Efficiency

NDDV frameworks provide single-pass, unified data valuation, avoiding repeated model retraining common in classical Shapley value or leave-one-out schemes (Liang et al., 30 Apr 2024, Wibiral et al., 5 Dec 2024):

Forward step: Integrate state SDE/ODE for minibatches.
Backward step: Integrate adjoint updates via reverse-mode autodiff.
Gradient updates: Perform SGD/Adam steps on controls and meta-parameters.
Valuation: After training, compute $U_i = -X_{i,T} \cdot Y_{i,T}$ to obtain sample values.

Complexity per epoch is $O(Nd/b)$ for batch size $b$ and feature dim $d$ , with total scaling $O(kN/b)$ for $k$ epochs. NN-CIFT achieves further gains: for a $7$–$8$B LLM, pairwise DELIFT requires $\sim$ 67,000s; NN-CIFT needs 215s—yielding 77–99% wall-clock speedup for instruction fine-tuning subset selection (Agarwal et al., 14 Feb 2025).

LossVal (Wibiral et al., 5 Dec 2024) embeds per-sample weights $w_i$ directly into the loss: $\mathrm{LossVal}(\theta,w) = L_w(\theta,w) \cdot (\mathrm{OT}_w)^2,$ where $L_w$ is a weighted loss and $\mathrm{OT}_w$ a Sinkhorn-regularized optimal transport distance to a validation set. LossVal performs joint gradient descent on both network and weights, updating $w$ dynamically to reflect per-sample importance, without retraining loops.

5. Theoretical Guarantees: Stability, Error Bounds, and Convergence

Under routine convexity, Lipschitz, and smoothness assumptions, NDDV admits quadratic loss bounds: $|L(\psi) - L(\psi')| \leq C \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \|\psi_{i,t} - \psi'_{i,t}\|^2,$ where $C$ is universal, $N$ sample size, and $T$ the time discretization. This ensures that minor weight/control perturbations have only controlled quadratic effects on the global loss, underpinning the stability of the forward–backward SDE routines (Liang et al., 9 Nov 2025).

Convergence analysis covers both the inner control and meta-optimization:

For training-loss gradients: $\lim_{k\to\infty} \mathbb{E}\|\nabla_\psi L(\psi^k;\theta^{k+1})\|^2 = 0$ .
For meta-loss: $O(1/\sqrt{K})$ sublinear convergence to a stationary point, so $K = O(1/\epsilon^2)$ meta-iterations suffice to hit stationarity within accuracy $\epsilon$ (Liang et al., 9 Nov 2025).

The dynamic character—single unified forward–backward sweeps, closed-form sensitivity-based value assignment—enables all data-point values to be computed in one run, a substantial improvement over classical influence or Shapley approaches (Liang et al., 30 Apr 2024, Liang et al., 9 Nov 2025).

6. Empirical Performance and Applications

Comprehensive experiments on tabular, text, and image benchmarks demonstrate that NDDV and its variants surpass classical methods in efficiency, accuracy, and robustness:

Corrupted data detection: NDDV achieves 10–20% higher F1 in noisy-label settings compared to KNNShapley, AME, Data-OOB, and influence-based methods (Liang et al., 30 Apr 2024).
Subset selection: In instruction fine-tuning for LLMs, NN-CIFT yields only 1.4% average drop in performance metrics (ROUGE, BGE, LAJ) relative to original influence functions, despite dramatic computational speedup (Agarwal et al., 14 Feb 2025).
Computational scaling: NDDV scales linearly with dataset size ( $N$ up to $10^6$ ); LossVal is 2–10x faster than Data-OOB and dramatically outpaces retraining-based approaches (Wibiral et al., 5 Dec 2024).
Data addition/removal: Removal of high-value points degrades generalization fastest, while addition of high-value points yields the largest performance boost—demonstrating effective capture of data utility (Liang et al., 30 Apr 2024).

7. Limitations, Open Problems, and Future Directions

NDDV's efficacy rests partly on the quality of "ground-truth" influence signals (for surrogate networks) and meta-losses (for re-weighting), as surrogate models will inherit any noise from these sources (Agarwal et al., 14 Feb 2025). Although the method achieves quadratic cost $O(|D_F||D_T|)$ for pairwise scoring, whether fully linear NDDV is possible remains open. LossVal requires access to a clean validation set for optimal transport computations, and the OT step can become costly for very large $N$ or $J$ (Wibiral et al., 5 Dec 2024).

Future research aims at:

Meta-learning influence networks capable of few-shot adaptation to new influence metrics (Agarwal et al., 14 Feb 2025).
Continual/online updates to NDDV models as new data arrives.
Extending NDDV approaches to other loss families, scalable optimal-transport layers, and task-specific continual learning setups (Wibiral et al., 5 Dec 2024).
Investigation of standardized benchmarks for dynamic data valuation.

The convergence guarantees, empirical scalability, and dynamic responsiveness mark NDDV as a prominent class of data valuation methods, with ongoing advances expected in both theoretical and applied aspects across data-centric machine learning.