Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shapley Value of Whole Columns

Updated 16 January 2026
  • Shapley value of whole columns is a principled method that attributes feature utility by averaging marginal contributions over all subsets, based on cooperative game theory.
  • The approach leverages both sampling-based and nonparametric regression methods to efficiently approximate contributions in high-dimensional datasets.
  • Practical variants like SHAP and SAGE illustrate how different payoff functions yield distinct feature attributions, highlighting trade-offs in model interpretation and feature selection.

The Shapley value of whole columns is a principled, axiomatic approach to attributing aggregate utility or importance to each feature column in a dataset, based on averaging the marginal contributions that features make to a model’s performance across all possible subsets. Originating in cooperative game theory and widely adopted in machine learning for Explainable AI (XAI), the Shapley value treats each feature as an agent in a game where the payoff is the model’s evaluation function—such as R2R^2, expected log-loss reduction, or information criteria—computed on arbitrary subsets of features. Computation and interpretation of whole-column Shapley values, together with their limitations for feature selection, comprise a central topic in the current literature (Fryer et al., 2021, Miftachov et al., 2022, Li et al., 2024).

1. Formal Definition of the Shapley Value for Feature Columns

Consider a set N={1,2,...,d}N = \{1, 2, ..., d\} indexing the dd feature columns of a dataset. Let v:2NRv : 2^N \to \mathbb{R} be a payoff function assigning to each subset SNS \subseteq N the value v(S)v(S), representing the performance achieved by a model utilizing exactly features SS (with v()=0v(\emptyset) = 0 by convention). The Shapley value φi\varphi_i for feature iNi \in N is defined as: φi=SN{i}S!(dS1)!d![v(S{i})v(S)]\varphi_i = \sum_{S \subseteq N\setminus\{i\}} \frac{|S|! (d - |S| - 1)!}{d!} [v(S \cup \{i\}) - v(S)] Here, SS ranges over all subsets not containing ii, the combinatorial weight reflects the fraction of all feature orderings in which exactly the features in SS precede ii, and the bracketed term is the marginal contribution of ii to SS. Intuitively, φi\varphi_i quantifies the average extra gain from feature ii across all possible contexts in which it may be added (Fryer et al., 2021).

2. Axiomatic Properties and Their Role

The classical Shapley value allocation is uniquely determined by four axioms interpreted in the feature-column setting:

  • Efficiency: iNφi=v(N)\sum_{i \in N} \varphi_i = v(N). The total full-model utility is exactly allocated among the dd features.
  • Symmetry: If two features always contribute the same increment to every coalition, their Shapley values are equal.
  • Dummy (Null Player): If a feature never increases performance in any subset, its Shapley value is zero.
  • Additivity: The Shapley value distributes over payoff functions: for v,wv, w, φi(v+w)=φi(v)+φi(w)\varphi_i(v + w) = \varphi_i(v) + \varphi_i(w).

These axioms underlie a form of "game-theoretic fairness": each feature’s score depends on marginal utility averaged over every possible subset (Fryer et al., 2021). However, this collective rationality does not align perfectly with typical feature selection objectives (see Section 4).

3. Estimation Strategies for Shapley Values

Direct computation of all 2d2^d marginal contributions is infeasible for large dd, prompting both analytical and statistical approximations:

3.1. Sampling-Based Approximation (OFA–A Framework)

The OFA–A (One-For-All) framework provides a unified stochastic estimator to efficiently approximate Shapley values for each feature:

  • Let n=dn = d be the number of features.
  • For subset sizes s=2,...,n2s = 2, ..., n-2, define the sampling probabilities qs11/s(ns)q_{s-1} \propto 1/\sqrt{s(n-s)}, normalized so that they sum to one.
  • Draw T=O(nϵ2log(n/δ))T = O\big(n \epsilon^{-2} \log(n/\delta)\big) samples of random subsets of size ss, recording, for each feature ii, whether it is present (updating φ^i,s+\widehat{\varphi}^+_{i,s}) or absent (updating φ^i,s1\widehat{\varphi}^-_{i,s-1}) in SS.
  • The overall Shapley estimate is computed by

φ^i=s=1nms(φ^i,s+φ^i,s1)\widehat{\varphi}_i = \sum_{s=1}^n m_s \big( \widehat{\varphi}^+_{i,s} - \widehat{\varphi}^-_{i,s-1} \big)

for ms=(n1s1)/nm_s = \binom{n-1}{s-1}/n (Li et al., 2024).

3.2. Nonparametric Regression-Based Estimation

For the regression problem Y=m(X)+εY = m(X) + \varepsilon, the population-level Shapley curve for feature jj is: ϕj(x)=SN{j}wj,S[mS{j}(xS,xj)mS(xS)]\phi_j(x) = \sum_{S \subseteq N \setminus \{j\}} w_{j, S} \left[ m_{S \cup \{j\}}(x_S, x_j) - m_S(x_S) \right ] with wj,S=1/d(d1S)1w_{j,S} = 1/d \binom{d-1}{|S|}^{-1}. The global (integrated) Shapley value is

Φj=EX[ϕj(X)]\Phi_j = E_X[\phi_j(X)]

Two estimation approaches are:

  • Component-based: Separate local-linear regressions for all subset regressions ms(xs)m_s(x_s), with plug-in estimates for differences.
  • Integration-based: Full dd-variate regression fitted, with estimates for marginal means constructed via Monte Carlo or kernel methods (Miftachov et al., 2022).

Statistical theory guarantees minimax-optimal convergence rates O(n4/(4+d))O(n^{-4/(4+d)}) for mean-integrated squared error under appropriate smoothness (Miftachov et al., 2022).

A wild-bootstrap procedure enables valid confidence bands for the estimated ϕj(x)\phi_j(x) by residual reweighting and local refitting.

4. Failures of Shapley Value for Feature Selection

Despite their theoretical fairness, Shapley values may counteract the goal of identifying compact, parsimonious feature subsets. Three illustrative counterexamples confirm this:

  • Taxicab payoff: Irrelevant features can receive strictly positive Shapley values if they improve performance only in suboptimal models, not in the global optimum.
  • Secret holder problem: Essential features with crucial conditional contributions may not receive maximal Shapley credit if their marginal contributions are hidden except in particular coalitions.
  • Monotonic suboptimality: Under non-monotonic payoffs (e.g., AIC/BIC), the efficiency axiom forces credit allocation to features that optimal model selection would exclude.

Simulations show that mean-absolute SHAP (using predicted values as payoffs) routinely selects spurious features in Markov boundary and interaction models, while SAGE (using expected-loss difference) more robustly highlights features central to the optimal predictive submodel—but is not universally perfect (Fryer et al., 2021).

5. Variants and Interpretations: SHAP, SAGE, and Shapley Curves

Two major practical applications of column-wise Shapley value are:

  • SHAP (SHapley Additive exPlanations): Computes local Shapley values for individual predictions, using conditional expectation of model output under partially missing features; global SHAP importance averages these absolute values over data points. SHAP’s payoff function is based on prediction, potentially leading to averaging over submodels not optimal for feature selection (Fryer et al., 2021).
  • SAGE (Shapley Additive Global importance): Employs a payoff function based on reduction in expected model loss (e.g., cross-entropy); resulting SAGE values correlate more closely with true submodel relevance (Fryer et al., 2021).
  • Shapley curves: Provide a continuous function ϕj(x)\phi_j(x) decomposing a feature’s local contribution across the feature and sample space, with statistical estimation techniques for both pointwise and global quantities and associated uncertainty bands (Miftachov et al., 2022).

The choice of evaluation function v(S)v(S) is critical: SHAP and SAGE yield different attribution behaviors for the same dataset and model, especially under structural confounding or non-monotonic scoring (Fryer et al., 2021, Miftachov et al., 2022).

6. Practical Lessons, Recommendations, and Statistical Guarantees

The global Shapley value for a column reflects marginal importance averaged over all submodels, not just those in or near the optimum. Several recommendations have emerged:

  • Match v(S)v(S) to the inferential or predictive goal: use SAGE’s loss-based vv for overall model performance.
  • Recognize that Shapley-based rankings can be misaligned with both Markov boundary membership and minimal-optimal feature subsets.
  • For non-monotonic payoffs, the Shapley efficiency constraint allocates credit to globally irrelevant features.
  • Estimating Shapley value sampling variance or constructing confidence intervals for attributions is essential; high variance indicates instability of rankings (Fryer et al., 2021, Miftachov et al., 2022).
  • For large-dd, employ scalable techniques: efficient MC sampling (OFA-A), kernel approximations, and tree-structured recursions (Li et al., 2024).
  • Leverage domain knowledge for pre-selection or fixed-inclusion to mitigate known pathologies such as taxicab payoff effects (Fryer et al., 2021).

The wild bootstrap for nonparametric Shapley curves yields consistent confidence bands, while one-shot MC-sampling approaches (OFA-A) enable fast, simultaneous (ϵ,δ)(\epsilon, \delta)-approximate estimation of all feature column Shapley values with rigorously quantifiable error (Miftachov et al., 2022, Li et al., 2024).

7. Summary Table: Algorithms and Implementations

Method Underlying v(S)v(S) Computational Complexity
Exact enumeration Any O(2d)O(2^d)
Kernel SHAP SHAP (prediction averaging) MC sampling, O(nlogn)O(n \log n)
TreeSHAP SHAP for tree ensembles Polynomial in dd, nn
Monte Carlo SAGE Expected-loss reduction MC sampling, O(nlogn)O(n \log n)
OFA–A one-for-all Any Beta-probabilistic value O(nlogn)O(n \log n)
Shapley curves E[YX=x]E[Y \mid X = x] (population) Nonparametric, O(n1+γ)O(n^{1+\gamma})

All sampling-based methods require specification of the payoff function and estimation of performance scores or losses on arbitrary feature subsets. Optimal implementation depends on the feature dimensionality, nature of v(S)v(S), and relevance of valid uncertainty quantification.

In conclusion, the Shapley value provides a rigorously grounded but not universally optimal solution for feature column attribution and selection, with estimation, interpretive, and methodological nuances that must be matched carefully to the analytical objective and data structure (Fryer et al., 2021, Miftachov et al., 2022, Li et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shapley Value of Whole Columns.