Papers
Topics
Authors
Recent
2000 character limit reached

HyperSHAP: Game-Theoretic HPO Analysis

Updated 13 January 2026
  • The paper introduces HyperSHAP as a game-theoretic framework that uses Shapley values to decompose hyperparameter optimization performance into main effects and interactions.
  • It employs specific explanation games—ablation, tunability, and sensitivity—to measure contributions and detect optimizer biases with both local and global analysis.
  • Empirical evaluations demonstrate that lower-order (pairwise or third-order) interactions robustly capture HPO dynamics, enabling improved hyperparameter space reduction and optimizer selection.

HyperSHAP is a game-theoretic framework for quantifying and explaining hyperparameter importance and interaction structure in hyperparameter optimization (HPO). It leverages Shapley values—originally formulated for cooperative game theory—to provide both local (per-configuration) and global (dataset-averaged) decompositions of HPO performance into additive main effects and interactions. The framework supports diverse analyses, including ablation, tunability, optimizer bias, and dynamic adaptation of the configuration space in multi-objective optimization (Wever et al., 3 Feb 2025, Theodorakopoulos et al., 6 Jan 2026).

1. Shapley Values and Interaction Indices in HPO

HyperSHAP models hyperparameters as players in a cooperative game, with the "payoff" defined via an explanation game ν:2NR\nu:2^{N}\rightarrow\mathbb{R} over coalitions SNS\subseteq N (where NN indexes the set of hyperparameters). The Shapley value ϕi(ν)\phi_i(\nu) uniquely allocates additive credit to each hyperparameter ii for the overall gain and satisfies efficiency, symmetry, linearity, and dummy axioms. Alternative expressions for the Shapley value include:

  • Permutation form:

ϕi(ν)=πPerm(N)1n![ν(Sπ(i){i})ν(Sπ(i))]\phi_i(\nu) = \sum_{\pi\in\text{Perm}(N)} \frac{1}{n!}\left[\nu(S_\pi(i)\cup\{i\}) - \nu(S_\pi(i))\right]

where Sπ(i)={jN:π(j)<π(i)}S_\pi(i) = \{j\in N: \pi(j)<\pi(i)\}.

  • Subset form:

ϕi(ν)=SN{i}S!(nS1)!n![ν(S{i})ν(S)]\phi_i(\nu) = \sum_{S\subseteq N\setminus\{i\}} \frac{|S|!(n-|S|-1)!}{n!}\left[\nu(S\cup\{i\})-\nu(S)\right]

Pairwise Shapley interaction indices ϕi,j(ν)\phi_{i,j}(\nu) capture non-additive effects:

ϕi,j(ν)=SN{i,j}S!(nS2)!2(n1)!(ν(S{i,j})ν(S{i})ν(S{j})+ν(S))\phi_{i,j}(\nu) = \sum_{S\subseteq N\setminus\{i,j\}} \frac{|S|!(n - |S| - 2)!}{2(n-1)!}\left(\nu(S\cup\{i,j\}) - \nu(S\cup\{i\}) - \nu(S\cup\{j\}) + \nu(S)\right)

Higher-order interactions up to order kk can be defined, with k=nk=n recovering the full Möbius transform across 2n2^n pure interactions.

2. HyperSHAP Explanation Games: Ablation, Tunability, and Sensitivity

Several specific explanation games are constructed:

  • Ablation Game (νA\nu_A): Measures the contribution of SS by evaluating the performance of a target configuration λ\lambda^* relative to a baseline λ0\lambda^0, with only parameters in SS set to λ\lambda^*. This quantifies ablation-style, post-hoc attributions:

νA(S):=Val(λSλ0,D)\nu_A(S) := \text{Val}(\lambda^* \oplus_S \lambda^0, D)

  • Tunability Game (νT\nu_T): Captures the maximal benefit of tuning SS, holding other parameters at the baseline:

νT(S):=maxλΛVal(λSλ0,D)\nu_T(S) := \max_{\lambda \in \Lambda} \text{Val}(\lambda \oplus_S \lambda^0, D)

This game is monotone in SS, and Shapley values decompose the total tunability gain from λ0\lambda^0 to the global optimum.

  • Sensitivity Game (νV\nu_V): Based on variance decomposition, often yielding different attributions from tunability, especially when domains differ.

The Shapley decomposition satisfies the efficiency property:

iNϕi+i<jϕi,j+=ν(N)ν()\sum_{i \in N} \phi_i + \sum_{i<j} \phi_{i,j} + \cdots = \nu(N) - \nu(\emptyset)

where for the tunability game, νT()=Val(λ0,D)\nu_T(\emptyset) = \text{Val}(\lambda^0, D) and νT(N)=maxλVal(λ,D)\nu_T(N) = \max_\lambda \text{Val}(\lambda, D).

3. Global and Local Explanations: Algorithms and Computational Aspects

HyperSHAP distinguishes between local explanations (per specific configuration or trial) and global explanations (averaged over datasets or configurations):

  • Local: Fixes λ\lambda^* and applies the ablation game.
  • Global: Aggregates results from ablation or tunability games across datasets or multiple configurations.

For global tunability explanations, the following algorithmic outline is used:

1
2
3
4
Input: default λ^0, search space Λ, dataset D, performance Val, explanation order k, optimizers O₁...O_m
1. For S ⊆ N with |S| ≤ k: approximate ν_T(S) via restricted optimizer runs
2. Compute φ_i and φ_{i,j} using Shapley formulas on {ν_T(S)}
3. Optionally, average ν_T^D(S) across datasets before computing attributions
Sampling-based Shapley estimators and k-additive truncation (with k=1,2,3k=1,2,3) using the Faithful Shapley Interaction Index (FSII) are employed to reduce computational cost for large nn (Wever et al., 3 Feb 2025).

4. Empirical Evaluation: Benchmarks, Interactions, and Optimizer Insights

HyperSHAP has demonstrated utility across benchmarks such as:

  • lcbench (Auto-PyTorch CNN tuning on 34 vision datasets)
  • rbv2_ranger (random-forest tuning on 119 tasks)
  • PD1 (transformers and image-classifier tuning)
  • JAHS-Bench-201 (joint architecture+HPO tasks)

Key empirical findings include:

  • Higher-order Möbius interactions exist but are robustly summarized by pairwise or third-order interactions (R20.99R^2\approx0.99 at k=3k=3), with lower-order effects typically dominating.
  • Negative pairwise ϕi,j\phi_{i,j} values often identify redundancy, indicating that two hyperparameters both provide improvements but not in a fully additive manner.
  • The "optimizer bias" game, comparing surrogate-based tunability to actual optimizer performance, identifies synergistic effects missed by specific optimizers (e.g., independent tuning yields zero ϕi\phi_i but nonzero ϕi,j\phi_{i,j}).
  • Downstream, restricting HPO runs to top hyperparameters as identified by HyperSHAP outperforms selections by functional-ANOVA or variance-based methods in anytime performance (Wever et al., 3 Feb 2025).

5. Scalability, Approximation, and Computational Complexity

Computing exact Shapley attributions for nn hyperparameters requires 2n2^n evaluations per value function, which is intractable for n20n\gg20. HyperSHAP addresses this with:

  • Surrogate-based HPO (e.g., YAHPO-Gym) for efficient performance simulation.
  • Random search approximations for argmaxλΛS\text{argmax}_{\lambda\in \Lambda_S} in tunability games.
  • Monte-Carlo estimators for the Shapley value, sampling coalitions or permutations instead of full enumeration.
  • Restricting to kk-additive (typically k=2k=2 or $3$) interactions for practical computation; for n10n\approx10–$15$, full enumeration remains feasible.

Typical computational times range from 5–125 s (ablation), 350–30,000 s (tunability), and up to 10,000 s for multi-dataset tunability, given single-CPU operation and sampling-based estimators (Wever et al., 3 Feb 2025).

6. Dynamic Importance in Multi-Objective Optimization

HyperSHAP has been adapted for dynamic importance estimation in multi-objective optimization (MOO), as in HPI-ParEGO (Theodorakopoulos et al., 6 Jan 2026). In this context:

  • Each hyperparameter is a player in a cooperative game where the payoff is improvement in a surrogate-predicted scalarized objective (via ParEGO scalarization).
  • At each ParEGO iteration, HyperSHAP samples subsets SS, estimates marginal contributions of each λ(j)\lambda^{(j)} to improvement, and aggregates these to derive an importance vector (ϕ1,...,ϕd)(\phi_1, ..., \phi_d).
  • The configuration space is dynamically reduced by fixing hyperparameters with low ϕj\phi_j, determined via a "Symmetric-0.8" schedule (aggressive reduction during the middle third of trials).
  • Empirical evaluation on PyMOO (ZDT1–4, ZDT6) and YAHPO-Gym (LCBench and rbv2_ranger) demonstrates faster convergence and improved Pareto front quality relative to standard ParEGO, MO-TPE, NSGA-II, and DE baselines.

The computational overhead of HyperSHAP in HPI-ParEGO scales as O(Tdlogn)O(T \cdot d \cdot \log n), where TT is the Monte-Carlo sample count and dd the number of hyperparameters. This is substantially lower than that of real function evaluations (Theodorakopoulos et al., 6 Jan 2026).

7. Assumptions, Limitations, and Generalization

HyperSHAP's validity depends on several assumptions:

  • The baseline configuration λ0\lambda^0 must be representative; alternatives with marginal or conditional baselines are possible.
  • The surrogate model for performance prediction should approximate the true learner’s performance with sufficient fidelity.
  • The monotonicity of the tunability game is required for nonnegativity and interpretability of ϕi\phi_i.
  • Approximate attributions rely on adequate sampling; very high-dimensional settings may require further truncation or alternative approaches.

A plausible implication is that the Faithful Shapley Interaction Index (FSII) approach, combined with aggressive configuration space adaptation, enables scalable application of HyperSHAP even in the presence of many objectives or large parameter sets.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to HyperSHAP.