Hierarchical Contrastive Data Valuation (HCDV)
- The paper introduces HCDV, a scalable method that leverages contrastive representation, balanced hierarchical clustering, and Monte Carlo games to approximate Shapley values with controllable error.
- It overcomes the factorial computational bottleneck by reducing complexity from O(n!) to O(K log n) and supports diverse applications such as data filtering, streaming updates, and fair marketplace payouts.
- Empirical benchmarks on datasets ranging from synthetic examples to Criteo-1B validate HCDV’s superior runtime, stability, and predictive improvements with strong theoretical guarantees.
Hierarchical Contrastive Data Valuation (HCDV) is a scalable, Shapley-inspired framework for quantifying the value of individual training examples in large, heterogeneous, and geometrically structured datasets. HCDV is designed to address the factorial computational bottleneck and the geometric insensitivity of classical data-Shapley methods by leveraging a contrastively trained representation, a balanced hierarchical clustering, and local Shapley-style valuation via budget-propagated Monte Carlo games. The method provides theoretical guarantees of approximate Shapley axioms with controllable error, achieves superior runtime and stability, and directly supports diverse applications such as data filtering, streaming updates, and marketplace payouts (Xiao et al., 22 Dec 2025).
1. Foundations and Motivation
Classical Data-Shapley assigns a payoff to each data point via the formula
where is typically a model performance metric (e.g., accuracy, AUC). This approach is intractable for large () and treats data points independently, ignoring latent geometric structure. Modern datasets often exceed , are heterogeneous, and lie on complex manifolds, making classical approaches unsuitable.
HCDV replaces the pointwise "player" assumption with a hierarchy of cluster coalitions, allocating Shapley-style payoffs in a top-down manner. This design aims to:
- Scale to large by reducing factorial complexity.
- Respect multi-scale geometric structure.
- Regularize outliers through curvature-based constraints.
2. Three-Stage Methodology
HCDV consists of three main stages:
Stage I: Contrastive, Geometry-Preserving Representation
An encoder (e.g., ResNet/CNN for images, MLP for tabular data) is trained to both maximize predictive performance and amplify inter-class geometry via augmentations and a contrastive objective:
with
- : model performance on .
- : normalized dispersion over cross-label pairs in .
- : curvature-based smoothness regularizer,
where .
Stage II: Balanced Hierarchical Clustering
The embedded data are hierarchically partitioned via a balanced recursive -means, producing cluster sets at each level , where each cluster is subdivided until all leaves have size . Capacity constraints,
prevent imbalanced splits. Total coalitions is ; hierarchy depth is for constant .
Stage III: Local Monte Carlo Shapley-Style Payoffs
At each level , clusters are treated as players, and a Monte Carlo approximation is used to estimate local Shapley values:
where each is a random permutation, and is the characteristic function incorporating both performance and dispersion:
Estimated payoffs are normalized and propagated downward:
At the leaf level, payoffs are either computed exactly (if ) or uniformly split among members.
Total computational cost is , with the cost of a forward pass, eliminating the factorial growth of the original method.
3. Theoretical Guarantees
HCDV approximately preserves the four classical Shapley axioms under mild assumptions:
- Approximate Efficiency: The deviation in total value mass satisfies
where controls MC approximation error.
- Coalition Deviation: With permutations,
and ensures all errors with high probability.
- Top- Surrogate Regret: For top- selection, regret is bounded by for .
- Approximate Symmetry, Dummy, and Additivity: Monte Carlo Shapley at the coalition level is unbiased and additive, with errors ; at the point level, symmetry/dummy deviations are at most .
4. Empirical Evaluation and Benchmarks
Extensive evaluation across four tasks demonstrates HCDV's accuracy, efficiency, and stability (Xiao et al., 22 Dec 2025):
| Dataset | Application | HCDV Gains | |
|---|---|---|---|
| Synthetic Two-Gaussians | $3,000$ | Toy, geometric sensitivity | Predictive lift up to pp |
| UCI Adult | $48,842$ | Tabular/classification | Up to faster valuation |
| Fashion-MNIST | $70,000$ | Image/augmentation | $25$--$40$\% lower CV (stability) |
| Criteo-1B CTR | $45$M | Click-through prediction | --$5$pp AUC; sub-$2$s latency |
Baseline comparisons include Monte Carlo Data-Shapley (MCDS), Group Shapley (GS), Data Banzhaf, Random, DU-Shapley, and KNN-Shapley. OpenDataVal suite evaluation shows HCDV consistently improves predictive utility after training on the top of samples, reduces coefficient of variation, and slashes valuation time.
5. Downstream Applications
HCDV directly enables several data-driven tasks:
- Augmentation Filtering: On Fashion-MNIST with $10k$ augmentations, HCDV selects top-$1k$ samples that yield pp accuracy, cluster overlap, and full class coverage.
- Low-Latency Streaming Updates: For simulated click-streams ($10$ steps of $1.5k$ samples), incremental HCDV achieves of AUC from full recompute, with speedup and sub-$2$s latency.
- Fair Marketplace Payouts: On UCI Adult with $5$ sellers, HCDV's seller payoff correlates at with leave-one-out marginals and achieves low Gini index, at $3.8$ min runtime compared to $32.5$ min for MCDS.
6. Hyperparameters, Implementation, and Practical Considerations
Practical instantiation of HCDV requires careful selection of hyperparameters:
- Permutation budget (): $128$--$512$ typically sufficient for .
- Cluster counts: –$32$, –$128$, leaf size –$256$.
- Contrastive/smoothness weights: , .
- Compute: One GPU for embedding; GPU + multi-core CPU for clustering and payoff computation.
- Implementation Pseudocode: (Algorithm 1 in (Xiao et al., 22 Dec 2025))—train contrastive encoder, embed points, build balanced hierarchy, propagate budgets, and estimate local Shapley values at each level. Streaming updates handled via Algorithm 2: assign new data to nearest leaves, update subtrees, and rebalance as needed.
Embedding quality and smoothness critically impact valuation accuracy. Cluster depth and leaf size control the compute/granularity trade-off. Early stopping on validation metric is advised.
7. Extensions, Limitations, and Future Directions
HCDV's generality supports extensions to federated and active learning environments (Editor's term). Its requirements—balanced trees, bounded characteristic functions, and accessible model/dispersion metrics—are compatible with standard supervised learning pipelines.
A plausible implication is that HCDV can serve as a template for efficient, geometry-aware valuation in privacy-preserving or distributed settings. Modulating hierarchy granularity and representation regularization may further enhance both interpretability and downstream performance.
Current limitations include sensitivity to the clustering quality and the assumed adequacy of the contrastive embedding for all downstream tasks. The uniform split heuristic in large leaves introduces approximation error, though bounded in practice.
HCDV represents an advance in efficient, scalable, and theoretically substantiated data valuation, achieving practical applicability across multiple domains, including streaming and fair exchange, while maintaining approximate compliance with key Shapley principles (Xiao et al., 22 Dec 2025).