Hierarchical Contrastive Data Valuation (HCDV)

Updated 29 December 2025

The paper introduces HCDV, a scalable method that leverages contrastive representation, balanced hierarchical clustering, and Monte Carlo games to approximate Shapley values with controllable error.
It overcomes the factorial computational bottleneck by reducing complexity from O(n!) to O(K log n) and supports diverse applications such as data filtering, streaming updates, and fair marketplace payouts.
Empirical benchmarks on datasets ranging from synthetic examples to Criteo-1B validate HCDV’s superior runtime, stability, and predictive improvements with strong theoretical guarantees.

Hierarchical Contrastive Data Valuation (HCDV) is a scalable, Shapley-inspired framework for quantifying the value of individual training examples in large, heterogeneous, and geometrically structured datasets. HCDV is designed to address the factorial computational bottleneck and the geometric insensitivity of classical data-Shapley methods by leveraging a contrastively trained representation, a balanced hierarchical clustering, and local Shapley-style valuation via budget-propagated Monte Carlo games. The method provides theoretical guarantees of approximate Shapley axioms with controllable error, achieves superior runtime and stability, and directly supports diverse applications such as data filtering, streaming updates, and marketplace payouts (Xiao et al., 22 Dec 2025).

1. Foundations and Motivation

Classical Data-Shapley assigns a payoff $\phi_i$ to each data point via the formula

$\phi_i(\mathcal D) = \sum_{S\subseteq\mathcal D\setminus\{i\}} \frac{|S|!\,(n-|S|-1)!}{n!}\left[v(S\cup\{i\})-v(S)\right]$

where $v(\cdot)$ is typically a model performance metric (e.g., accuracy, AUC). This approach is intractable for large $n$ ( $O(n!)$ ) and treats data points independently, ignoring latent geometric structure. Modern datasets often exceed $n>10^5$ , are heterogeneous, and lie on complex manifolds, making classical approaches unsuitable.

HCDV replaces the pointwise "player" assumption with a hierarchy of cluster coalitions, allocating Shapley-style payoffs in a top-down manner. This design aims to:

Scale to large $n$ by reducing factorial complexity.
Respect multi-scale geometric structure.
Regularize outliers through curvature-based constraints.

2. Three-Stage Methodology

HCDV consists of three main stages:

Stage I: Contrastive, Geometry-Preserving Representation

An encoder $f_\theta: \mathcal X \rightarrow \mathbb R^d$ (e.g., ResNet/CNN for images, MLP for tabular data) is trained to both maximize predictive performance and amplify inter-class geometry via augmentations and a contrastive objective:

$\max_\theta\, E_{S\sim\mathcal P_{\rm batch}} \left[\mathcal M(S)\, +\, \lambda\,\bar\Delta_c(S)\right] - \alpha\,\Omega(\theta)$

with

$\mathcal M(S)$ : model performance on $S$ .
$\bar\Delta_c(S)$ : normalized dispersion over cross-label pairs in $S$ .
$\Omega(\theta)$ : curvature-based smoothness regularizer,

$\Omega(\theta)\approx \frac{1}{|\mathcal Q|} \sum_{(p,q)\in\mathcal Q} \frac{1}{\epsilon}\, \mathbb E_r \left[d(f_\theta(x_p+\epsilon r), f_\theta(x_q)) - d(f_\theta(x_p), f_\theta(x_q))\right]^2$

where $r \sim \mathcal N(0,I)$ .

Stage II: Balanced Hierarchical Clustering

The embedded data are hierarchically partitioned via a balanced recursive $k$ -means, producing cluster sets $C_0 \to C_1 \to \cdots \to C_L$ at each level $\ell$ , where each cluster $G_k^{(\ell)}$ is subdivided until all leaves have size $\le M$ . Capacity constraints,

$s_\ell^{\min} \le |G_k^{(\ell+1)}| \le s_\ell^{\max},\quad s_\ell^{\pm} = (1\pm\gamma)\frac{|G|}{K_{\ell+1}}$

prevent imbalanced splits. Total coalitions is $O(K\log n) \ll n$ ; hierarchy depth is $L \approx \lceil \log_K(n/M)\rceil$ for constant $K$ .

Stage III: Local Monte Carlo Shapley-Style Payoffs

At each level $\ell$ , clusters are treated as players, and a Monte Carlo approximation is used to estimate local Shapley values:

$\widehat\psi_G^{(\ell)} = \frac{1}{T} \sum_{t=1}^T \left[v_\ell(\mathrm{Pre}_{\pi_t}(G)\cup\{G\}) - v_\ell(\mathrm{Pre}_{\pi_t}(G))\right]$

where each $\pi_t$ is a random permutation, and $v_\ell(S)$ is the characteristic function incorporating both performance and dispersion:

$v_\ell(S) = \mathcal M\left(\bigcup_{G\in S} G\right) + \lambda\,\bar\Delta_c\left(\bigcup_{G\in S} G\right)$

Estimated payoffs are normalized and propagated downward:

$\omega_H^{(\ell+1)} = \frac{\max\{v_\ell(\{H\}), 0\}}{\sum_{H'} \max\{v_\ell(\{H'\}), 0\} }, \qquad \widetilde\psi_H^{(\ell+1)} = \omega_H^{(\ell+1)}\, \widehat\psi_P^{(\ell)}$

At the leaf level, payoffs are either computed exactly (if $|G|\leq M$ ) or uniformly split among members.

Total computational cost is $O(\tau\,T\,K\log n)$ , with $\tau$ the cost of a forward pass, eliminating the factorial growth of the original method.

3. Theoretical Guarantees

HCDV approximately preserves the four classical Shapley axioms under mild assumptions:

Approximate Efficiency: The deviation in total value mass satisfies

$\left|\sum_{i=1}^n \phi_i - [v_0(C_0) - v_0(\emptyset)] \right| \le O(\eta \log n) + \varepsilon_{\rm leaf}$

where $\eta$ controls MC approximation error.

Coalition Deviation: With $T$ permutations,

$\Pr \{ |\widehat\psi_G^{(\ell)} - \psi_G^{(\ell)}| \ge \eta \} \le 2 \exp\left(-\frac{T\eta^2}{8B^2}\right)$

and $T = O(B^2 \log n/\eta^2)$ ensures all errors $\le \eta$ with high probability.

Top- $k$ Surrogate Regret: For top- $k$ selection, regret is bounded by $2k\varepsilon_\infty$ for $\|\phi^{\rm H} - \phi^{\rm Sh}\|_\infty \le \varepsilon_\infty = O(\eta\log n)$ .
Approximate Symmetry, Dummy, and Additivity: Monte Carlo Shapley at the coalition level is unbiased and additive, with errors $O(\eta)$ ; at the point level, symmetry/dummy deviations are at most $O(\varepsilon_\infty)$ .

4. Empirical Evaluation and Benchmarks

Extensive evaluation across four tasks demonstrates HCDV's accuracy, efficiency, and stability (Xiao et al., 22 Dec 2025):

Dataset	$n$	Application	HCDV Gains
Synthetic Two-Gaussians	$3,000$	Toy, geometric sensitivity	Predictive lift up to $+5$ pp
UCI Adult	$48,842$	Tabular/classification	Up to $100\times$ faster valuation
Fashion-MNIST	$70,000$	Image/augmentation	$25$--$40$\% lower CV (stability)
Criteo-1B $^*$ CTR	$45$M	Click-through prediction	$+1$ --$5$pp AUC; sub-$2$s latency

Baseline comparisons include Monte Carlo Data-Shapley (MCDS), Group Shapley (GS), Data Banzhaf, Random, DU-Shapley, and KNN-Shapley. OpenDataVal suite evaluation shows HCDV consistently improves predictive utility after training on the top $30\%$ of samples, reduces coefficient of variation, and slashes valuation time.

5. Downstream Applications

HCDV directly enables several data-driven tasks:

Augmentation Filtering: On Fashion-MNIST with $10k$ augmentations, HCDV selects top-$1k$ samples that yield $+2.8$ pp accuracy, $42\%$ cluster overlap, and full class coverage.
Low-Latency Streaming Updates: For simulated click-streams ($10$ steps of $1.5k$ samples), incremental HCDV achieves $99.6\%$ of AUC from full recompute, with $2.5\times$ speedup and sub-$2$s latency.
Fair Marketplace Payouts: On UCI Adult with $5$ sellers, HCDV's seller payoff correlates at $\rho=0.94$ with leave-one-out marginals and achieves low Gini index, at $3.8$ min runtime compared to $32.5$ min for MCDS.

6. Hyperparameters, Implementation, and Practical Considerations

Practical instantiation of HCDV requires careful selection of hyperparameters:

Permutation budget ( $T$ ): $128$--$512$ typically sufficient for $\varepsilon\le0.01$ .
Cluster counts: $K_1\approx16$ –$32$, $K_2\approx64$ –$128$, leaf size $M\approx64$ –$256$.
Contrastive/smoothness weights: $\lambda\approx10$ , $\alpha\approx10^{-3}$ .
Compute: One GPU for embedding; GPU + multi-core CPU for clustering and payoff computation.
Implementation Pseudocode: (Algorithm 1 in (Xiao et al., 22 Dec 2025))—train contrastive encoder, embed points, build balanced hierarchy, propagate budgets, and estimate local Shapley values at each level. Streaming updates handled via Algorithm 2: assign new data to nearest leaves, update subtrees, and rebalance as needed.

Embedding quality and smoothness critically impact valuation accuracy. Cluster depth $L$ and leaf size $M$ control the compute/granularity trade-off. Early stopping on validation metric $\mathcal M$ is advised.

7. Extensions, Limitations, and Future Directions

HCDV's generality supports extensions to federated and active learning environments (Editor's term). Its requirements—balanced trees, bounded characteristic functions, and accessible model/dispersion metrics—are compatible with standard supervised learning pipelines.

A plausible implication is that HCDV can serve as a template for efficient, geometry-aware valuation in privacy-preserving or distributed settings. Modulating hierarchy granularity and representation regularization may further enhance both interpretability and downstream performance.

Current limitations include sensitivity to the clustering quality and the assumed adequacy of the contrastive embedding for all downstream tasks. The uniform split heuristic in large leaves introduces approximation error, though bounded in practice.

HCDV represents an advance in efficient, scalable, and theoretically substantiated data valuation, achieving practical applicability across multiple domains, including streaming and fair exchange, while maintaining approximate compliance with key Shapley principles (Xiao et al., 22 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Contrastive Data Valuation (HCDV).