HyperSHAP: Game-Theoretic HPO Analysis
- The paper introduces HyperSHAP as a game-theoretic framework that uses Shapley values to decompose hyperparameter optimization performance into main effects and interactions.
- It employs specific explanation games—ablation, tunability, and sensitivity—to measure contributions and detect optimizer biases with both local and global analysis.
- Empirical evaluations demonstrate that lower-order (pairwise or third-order) interactions robustly capture HPO dynamics, enabling improved hyperparameter space reduction and optimizer selection.
HyperSHAP is a game-theoretic framework for quantifying and explaining hyperparameter importance and interaction structure in hyperparameter optimization (HPO). It leverages Shapley values—originally formulated for cooperative game theory—to provide both local (per-configuration) and global (dataset-averaged) decompositions of HPO performance into additive main effects and interactions. The framework supports diverse analyses, including ablation, tunability, optimizer bias, and dynamic adaptation of the configuration space in multi-objective optimization (Wever et al., 3 Feb 2025, Theodorakopoulos et al., 6 Jan 2026).
1. Shapley Values and Interaction Indices in HPO
HyperSHAP models hyperparameters as players in a cooperative game, with the "payoff" defined via an explanation game over coalitions (where indexes the set of hyperparameters). The Shapley value uniquely allocates additive credit to each hyperparameter for the overall gain and satisfies efficiency, symmetry, linearity, and dummy axioms. Alternative expressions for the Shapley value include:
- Permutation form:
where .
- Subset form:
Pairwise Shapley interaction indices capture non-additive effects:
Higher-order interactions up to order can be defined, with recovering the full Möbius transform across pure interactions.
2. HyperSHAP Explanation Games: Ablation, Tunability, and Sensitivity
Several specific explanation games are constructed:
- Ablation Game (): Measures the contribution of by evaluating the performance of a target configuration relative to a baseline , with only parameters in set to . This quantifies ablation-style, post-hoc attributions:
- Tunability Game (): Captures the maximal benefit of tuning , holding other parameters at the baseline:
This game is monotone in , and Shapley values decompose the total tunability gain from to the global optimum.
- Sensitivity Game (): Based on variance decomposition, often yielding different attributions from tunability, especially when domains differ.
The Shapley decomposition satisfies the efficiency property:
where for the tunability game, and .
3. Global and Local Explanations: Algorithms and Computational Aspects
HyperSHAP distinguishes between local explanations (per specific configuration or trial) and global explanations (averaged over datasets or configurations):
- Local: Fixes and applies the ablation game.
- Global: Aggregates results from ablation or tunability games across datasets or multiple configurations.
For global tunability explanations, the following algorithmic outline is used:
1 2 3 4 |
Input: default λ^0, search space Λ, dataset D, performance Val, explanation order k, optimizers O₁...O_m
1. For S ⊆ N with |S| ≤ k: approximate ν_T(S) via restricted optimizer runs
2. Compute φ_i and φ_{i,j} using Shapley formulas on {ν_T(S)}
3. Optionally, average ν_T^D(S) across datasets before computing attributions |
4. Empirical Evaluation: Benchmarks, Interactions, and Optimizer Insights
HyperSHAP has demonstrated utility across benchmarks such as:
- lcbench (Auto-PyTorch CNN tuning on 34 vision datasets)
- rbv2_ranger (random-forest tuning on 119 tasks)
- PD1 (transformers and image-classifier tuning)
- JAHS-Bench-201 (joint architecture+HPO tasks)
Key empirical findings include:
- Higher-order Möbius interactions exist but are robustly summarized by pairwise or third-order interactions ( at ), with lower-order effects typically dominating.
- Negative pairwise values often identify redundancy, indicating that two hyperparameters both provide improvements but not in a fully additive manner.
- The "optimizer bias" game, comparing surrogate-based tunability to actual optimizer performance, identifies synergistic effects missed by specific optimizers (e.g., independent tuning yields zero but nonzero ).
- Downstream, restricting HPO runs to top hyperparameters as identified by HyperSHAP outperforms selections by functional-ANOVA or variance-based methods in anytime performance (Wever et al., 3 Feb 2025).
5. Scalability, Approximation, and Computational Complexity
Computing exact Shapley attributions for hyperparameters requires evaluations per value function, which is intractable for . HyperSHAP addresses this with:
- Surrogate-based HPO (e.g., YAHPO-Gym) for efficient performance simulation.
- Random search approximations for in tunability games.
- Monte-Carlo estimators for the Shapley value, sampling coalitions or permutations instead of full enumeration.
- Restricting to -additive (typically or $3$) interactions for practical computation; for –$15$, full enumeration remains feasible.
Typical computational times range from 5–125 s (ablation), 350–30,000 s (tunability), and up to 10,000 s for multi-dataset tunability, given single-CPU operation and sampling-based estimators (Wever et al., 3 Feb 2025).
6. Dynamic Importance in Multi-Objective Optimization
HyperSHAP has been adapted for dynamic importance estimation in multi-objective optimization (MOO), as in HPI-ParEGO (Theodorakopoulos et al., 6 Jan 2026). In this context:
- Each hyperparameter is a player in a cooperative game where the payoff is improvement in a surrogate-predicted scalarized objective (via ParEGO scalarization).
- At each ParEGO iteration, HyperSHAP samples subsets , estimates marginal contributions of each to improvement, and aggregates these to derive an importance vector .
- The configuration space is dynamically reduced by fixing hyperparameters with low , determined via a "Symmetric-0.8" schedule (aggressive reduction during the middle third of trials).
- Empirical evaluation on PyMOO (ZDT1–4, ZDT6) and YAHPO-Gym (LCBench and rbv2_ranger) demonstrates faster convergence and improved Pareto front quality relative to standard ParEGO, MO-TPE, NSGA-II, and DE baselines.
The computational overhead of HyperSHAP in HPI-ParEGO scales as , where is the Monte-Carlo sample count and the number of hyperparameters. This is substantially lower than that of real function evaluations (Theodorakopoulos et al., 6 Jan 2026).
7. Assumptions, Limitations, and Generalization
HyperSHAP's validity depends on several assumptions:
- The baseline configuration must be representative; alternatives with marginal or conditional baselines are possible.
- The surrogate model for performance prediction should approximate the true learner’s performance with sufficient fidelity.
- The monotonicity of the tunability game is required for nonnegativity and interpretability of .
- Approximate attributions rely on adequate sampling; very high-dimensional settings may require further truncation or alternative approaches.
A plausible implication is that the Faithful Shapley Interaction Index (FSII) approach, combined with aggressive configuration space adaptation, enables scalable application of HyperSHAP even in the presence of many objectives or large parameter sets.
References:
- "HyperSHAP: Shapley Values and Interactions for Hyperparameter Importance" (Wever et al., 3 Feb 2025)
- "Dynamic Hyperparameter Importance for Efficient Multi-Objective Optimization" (Theodorakopoulos et al., 6 Jan 2026)