Papers
Topics
Authors
Recent
2000 character limit reached

Hierarchical Contrastive Data Valuation (HCDV)

Updated 29 December 2025
  • The paper introduces HCDV, a scalable method that leverages contrastive representation, balanced hierarchical clustering, and Monte Carlo games to approximate Shapley values with controllable error.
  • It overcomes the factorial computational bottleneck by reducing complexity from O(n!) to O(K log n) and supports diverse applications such as data filtering, streaming updates, and fair marketplace payouts.
  • Empirical benchmarks on datasets ranging from synthetic examples to Criteo-1B validate HCDV’s superior runtime, stability, and predictive improvements with strong theoretical guarantees.

Hierarchical Contrastive Data Valuation (HCDV) is a scalable, Shapley-inspired framework for quantifying the value of individual training examples in large, heterogeneous, and geometrically structured datasets. HCDV is designed to address the factorial computational bottleneck and the geometric insensitivity of classical data-Shapley methods by leveraging a contrastively trained representation, a balanced hierarchical clustering, and local Shapley-style valuation via budget-propagated Monte Carlo games. The method provides theoretical guarantees of approximate Shapley axioms with controllable error, achieves superior runtime and stability, and directly supports diverse applications such as data filtering, streaming updates, and marketplace payouts (Xiao et al., 22 Dec 2025).

1. Foundations and Motivation

Classical Data-Shapley assigns a payoff ϕi\phi_i to each data point via the formula

ϕi(D)=SD{i}S!(nS1)!n![v(S{i})v(S)]\phi_i(\mathcal D) = \sum_{S\subseteq\mathcal D\setminus\{i\}} \frac{|S|!\,(n-|S|-1)!}{n!}\left[v(S\cup\{i\})-v(S)\right]

where v()v(\cdot) is typically a model performance metric (e.g., accuracy, AUC). This approach is intractable for large nn (O(n!)O(n!)) and treats data points independently, ignoring latent geometric structure. Modern datasets often exceed n>105n>10^5, are heterogeneous, and lie on complex manifolds, making classical approaches unsuitable.

HCDV replaces the pointwise "player" assumption with a hierarchy of cluster coalitions, allocating Shapley-style payoffs in a top-down manner. This design aims to:

  • Scale to large nn by reducing factorial complexity.
  • Respect multi-scale geometric structure.
  • Regularize outliers through curvature-based constraints.

2. Three-Stage Methodology

HCDV consists of three main stages:

Stage I: Contrastive, Geometry-Preserving Representation

An encoder fθ:XRdf_\theta: \mathcal X \rightarrow \mathbb R^d (e.g., ResNet/CNN for images, MLP for tabular data) is trained to both maximize predictive performance and amplify inter-class geometry via augmentations and a contrastive objective:

maxθESPbatch[M(S)+λΔˉc(S)]αΩ(θ)\max_\theta\, E_{S\sim\mathcal P_{\rm batch}} \left[\mathcal M(S)\, +\, \lambda\,\bar\Delta_c(S)\right] - \alpha\,\Omega(\theta)

with

  • M(S)\mathcal M(S): model performance on SS.
  • Δˉc(S)\bar\Delta_c(S): normalized dispersion over cross-label pairs in SS.
  • Ω(θ)\Omega(\theta): curvature-based smoothness regularizer,

Ω(θ)1Q(p,q)Q1ϵEr[d(fθ(xp+ϵr),fθ(xq))d(fθ(xp),fθ(xq))]2\Omega(\theta)\approx \frac{1}{|\mathcal Q|} \sum_{(p,q)\in\mathcal Q} \frac{1}{\epsilon}\, \mathbb E_r \left[d(f_\theta(x_p+\epsilon r), f_\theta(x_q)) - d(f_\theta(x_p), f_\theta(x_q))\right]^2

where rN(0,I)r \sim \mathcal N(0,I).

Stage II: Balanced Hierarchical Clustering

The embedded data are hierarchically partitioned via a balanced recursive kk-means, producing cluster sets C0C1CLC_0 \to C_1 \to \cdots \to C_L at each level \ell, where each cluster Gk()G_k^{(\ell)} is subdivided until all leaves have size M\le M. Capacity constraints,

sminGk(+1)smax,s±=(1±γ)GK+1s_\ell^{\min} \le |G_k^{(\ell+1)}| \le s_\ell^{\max},\quad s_\ell^{\pm} = (1\pm\gamma)\frac{|G|}{K_{\ell+1}}

prevent imbalanced splits. Total coalitions is O(Klogn)nO(K\log n) \ll n; hierarchy depth is LlogK(n/M)L \approx \lceil \log_K(n/M)\rceil for constant KK.

Stage III: Local Monte Carlo Shapley-Style Payoffs

At each level \ell, clusters are treated as players, and a Monte Carlo approximation is used to estimate local Shapley values:

ψ^G()=1Tt=1T[v(Preπt(G){G})v(Preπt(G))]\widehat\psi_G^{(\ell)} = \frac{1}{T} \sum_{t=1}^T \left[v_\ell(\mathrm{Pre}_{\pi_t}(G)\cup\{G\}) - v_\ell(\mathrm{Pre}_{\pi_t}(G))\right]

where each πt\pi_t is a random permutation, and v(S)v_\ell(S) is the characteristic function incorporating both performance and dispersion:

v(S)=M(GSG)+λΔˉc(GSG)v_\ell(S) = \mathcal M\left(\bigcup_{G\in S} G\right) + \lambda\,\bar\Delta_c\left(\bigcup_{G\in S} G\right)

Estimated payoffs are normalized and propagated downward:

ωH(+1)=max{v({H}),0}Hmax{v({H}),0},ψ~H(+1)=ωH(+1)ψ^P()\omega_H^{(\ell+1)} = \frac{\max\{v_\ell(\{H\}), 0\}}{\sum_{H'} \max\{v_\ell(\{H'\}), 0\} }, \qquad \widetilde\psi_H^{(\ell+1)} = \omega_H^{(\ell+1)}\, \widehat\psi_P^{(\ell)}

At the leaf level, payoffs are either computed exactly (if GM|G|\leq M) or uniformly split among members.

Total computational cost is O(τTKlogn)O(\tau\,T\,K\log n), with τ\tau the cost of a forward pass, eliminating the factorial growth of the original method.

3. Theoretical Guarantees

HCDV approximately preserves the four classical Shapley axioms under mild assumptions:

  • Approximate Efficiency: The deviation in total value mass satisfies

i=1nϕi[v0(C0)v0()]O(ηlogn)+εleaf\left|\sum_{i=1}^n \phi_i - [v_0(C_0) - v_0(\emptyset)] \right| \le O(\eta \log n) + \varepsilon_{\rm leaf}

where η\eta controls MC approximation error.

  • Coalition Deviation: With TT permutations,

Pr{ψ^G()ψG()η}2exp(Tη28B2)\Pr \{ |\widehat\psi_G^{(\ell)} - \psi_G^{(\ell)}| \ge \eta \} \le 2 \exp\left(-\frac{T\eta^2}{8B^2}\right)

and T=O(B2logn/η2)T = O(B^2 \log n/\eta^2) ensures all errors η\le \eta with high probability.

  • Top-kk Surrogate Regret: For top-kk selection, regret is bounded by 2kε2k\varepsilon_\infty for ϕHϕShε=O(ηlogn)\|\phi^{\rm H} - \phi^{\rm Sh}\|_\infty \le \varepsilon_\infty = O(\eta\log n).
  • Approximate Symmetry, Dummy, and Additivity: Monte Carlo Shapley at the coalition level is unbiased and additive, with errors O(η)O(\eta); at the point level, symmetry/dummy deviations are at most O(ε)O(\varepsilon_\infty).

4. Empirical Evaluation and Benchmarks

Extensive evaluation across four tasks demonstrates HCDV's accuracy, efficiency, and stability (Xiao et al., 22 Dec 2025):

Dataset nn Application HCDV Gains
Synthetic Two-Gaussians $3,000$ Toy, geometric sensitivity Predictive lift up to +5+5pp
UCI Adult $48,842$ Tabular/classification Up to 100×100\times faster valuation
Fashion-MNIST $70,000$ Image/augmentation $25$--$40$\% lower CV (stability)
Criteo-1B^* CTR $45$M Click-through prediction +1+1--$5$pp AUC; sub-$2$s latency

Baseline comparisons include Monte Carlo Data-Shapley (MCDS), Group Shapley (GS), Data Banzhaf, Random, DU-Shapley, and KNN-Shapley. OpenDataVal suite evaluation shows HCDV consistently improves predictive utility after training on the top 30%30\% of samples, reduces coefficient of variation, and slashes valuation time.

5. Downstream Applications

HCDV directly enables several data-driven tasks:

  • Augmentation Filtering: On Fashion-MNIST with $10k$ augmentations, HCDV selects top-$1k$ samples that yield +2.8+2.8pp accuracy, 42%42\% cluster overlap, and full class coverage.
  • Low-Latency Streaming Updates: For simulated click-streams ($10$ steps of $1.5k$ samples), incremental HCDV achieves 99.6%99.6\% of AUC from full recompute, with 2.5×2.5\times speedup and sub-$2$s latency.
  • Fair Marketplace Payouts: On UCI Adult with $5$ sellers, HCDV's seller payoff correlates at ρ=0.94\rho=0.94 with leave-one-out marginals and achieves low Gini index, at $3.8$ min runtime compared to $32.5$ min for MCDS.

6. Hyperparameters, Implementation, and Practical Considerations

Practical instantiation of HCDV requires careful selection of hyperparameters:

  • Permutation budget (TT): $128$--$512$ typically sufficient for ε0.01\varepsilon\le0.01.
  • Cluster counts: K116K_1\approx16–$32$, K264K_2\approx64–$128$, leaf size M64M\approx64–$256$.
  • Contrastive/smoothness weights: λ10\lambda\approx10, α103\alpha\approx10^{-3}.
  • Compute: One GPU for embedding; GPU + multi-core CPU for clustering and payoff computation.
  • Implementation Pseudocode: (Algorithm 1 in (Xiao et al., 22 Dec 2025))—train contrastive encoder, embed points, build balanced hierarchy, propagate budgets, and estimate local Shapley values at each level. Streaming updates handled via Algorithm 2: assign new data to nearest leaves, update subtrees, and rebalance as needed.

Embedding quality and smoothness critically impact valuation accuracy. Cluster depth LL and leaf size MM control the compute/granularity trade-off. Early stopping on validation metric M\mathcal M is advised.

7. Extensions, Limitations, and Future Directions

HCDV's generality supports extensions to federated and active learning environments (Editor's term). Its requirements—balanced trees, bounded characteristic functions, and accessible model/dispersion metrics—are compatible with standard supervised learning pipelines.

A plausible implication is that HCDV can serve as a template for efficient, geometry-aware valuation in privacy-preserving or distributed settings. Modulating hierarchy granularity and representation regularization may further enhance both interpretability and downstream performance.

Current limitations include sensitivity to the clustering quality and the assumed adequacy of the contrastive embedding for all downstream tasks. The uniform split heuristic in large leaves introduces approximation error, though bounded in practice.

HCDV represents an advance in efficient, scalable, and theoretically substantiated data valuation, achieving practical applicability across multiple domains, including streaming and fair exchange, while maintaining approximate compliance with key Shapley principles (Xiao et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hierarchical Contrastive Data Valuation (HCDV).