Permutation SHAP: Sampling-Based Attribution

Updated 13 May 2026

Permutation SHAP is a sampling-based strategy that approximates Shapley value attributions with theoretical guarantees and explicit permutation sampling.
It employs methods like antithetic sampling, DOE schemes, and quasi-Monte Carlo techniques to reduce estimator variance and improve efficiency.
The approach enables robust global feature selection and sequential attribution, enhancing model-agnostic interpretability in complex scenarios.

Permutation Sampling, also known in the SHAP literature as Permutation SHAP or PermutationSHAP, refers to a family of sampling-based approximations and theoretical refinements of (Shapley value) feature importance estimators that rely on explicit sampling from permutations, either over features or over data entries. This technique underlies both theoretical guarantees about global feature importance and practical improvements in estimation efficiency for model-agnostic and black-box attribution frameworks.

1. Mathematical Formulation and Soundness

The basis of Permutation SHAP is the permutation-based definition of the Shapley value, which assigns to feature $i$ the mean marginal contribution of including $i$ across all possible orderings of input features:

$\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$

where $S_{\pi,i}$ is the set of features preceding $i$ in permutation $\pi$ and $d$ is the number of features (Yang et al., 2023, Mitchell et al., 2021).

In classical SHAP implementations, feature contributions are aggregated over observed data—sampling $x \sim \mu$ , the joint data distribution. However, this can produce unsound global feature importance: aggregate SHAP values can be small for features that the function genuinely depends on, due to the influence of points lying outside the data manifold (Bhattacharjee et al., 29 Mar 2025).

Permutation SHAP corrects this by using the extended distribution $\mu^* = \mu_1 \times ... \times \mu_d$ , obtained as the product of feature marginals. By independently permuting each column of the data matrix, one samples from $\mu^*$ . Aggregating SHAP values over $i$ 0 yields the extended-support aggregate SHAP: $i$ 1 Theoretical results show that

$i$ 2 is independent of $i$ 3 on $i$ 4 if and only if $i$ 5 for all $i$ 6 (exact soundness).
If $i$ 7, then $i$ 8 can be approximated (in $i$ 9) by a function $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 0 independent of $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 1 with error $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 2.

This ensures that small Permutation SHAP means aggregate theorems justify safe feature elimination, whereas classical SHAP does not possess this guarantee (Bhattacharjee et al., 29 Mar 2025).

2. Permutation Sampling Algorithms

Permutation SHAP estimation consists of two principal sampling operations, depending on the attribution goal:

A. Extended-Support Aggregation (Permutation over Data):

Given $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 3, independently permute each column, forming $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 4.
Compute SHAP (e.g., KernelSHAP) on $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 5, aggregating absolute values per feature.
The result approximates $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 6 and provides sound global feature importance (Bhattacharjee et al., 29 Mar 2025).

B. PermutationSHAP for Exact/Approximate Shapley Values (Permutation over Features):

Draw $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 7 random permutations of the feature set.
For each permutation $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 8 and each feature $\phi_i = \frac{1}{d!} \sum_{\pi \in S_d} \bigl[ f(x_{S_{\pi,i} \cup \{i\}}) - f(x_{S_{\pi,i}}) \bigr],$ 9, compute marginal contributions $S_{\pi,i}$ 0.
Aggregate over permutations to estimate $S_{\pi,i}$ 1 (Mayer et al., 18 Aug 2025, Yang et al., 2023, Mitchell et al., 2021).

Advanced variants employ antithetic or paired sampling (using reverse permutations) for variance reduction and exact recovery in certain models (bilinear interactions, additive decompositions) (Mayer et al., 18 Aug 2025).

Fractional factorial (DOE) methods such as Component Orthogonal Arrays (COA) or Latin Squares achieve structured, balanced coverage over permutation space, resulting in unbiased, lower-variance estimators compared to simple Monte Carlo sampling (Yang et al., 2023).

3. Variance Reduction and Quasi-Monte Carlo Techniques

Exploiting structure in the space of permutations leads to significantly improved convergence rates and estimator variance:

Paired sampling: Pair each permutation $S_{\pi,i}$ 2 with its reversal. For bilinear or additive value functions, a single paired-sample recovers $S_{\pi,i}$ 3 exactly. For general functions, variance is halved compared to unpaired sampling (Mayer et al., 18 Aug 2025).
Order-of-addition designs (DOE): Structured designs (COA, Latin square) ensure each feature occurs equally often in each position, which can reduce variance by factors of 2–5, and sometimes to zero for symmetric or position-only games (Yang et al., 2023).
Kernel quadrature and herding: Functions of permutations can be embedded in a Mallows-RKHS; kernel herding and sequential Bayesian quadrature provide residual error bounds significantly outperforming IID Monte Carlo for small to moderate dimensions (Mitchell et al., 2021).
Sobol and orthogonal spherical codes: By mapping permutations to equally spaced points on a hypersphere and leveraging low-discrepancy sets, quasi-Monte Carlo approximations ensure uniform coverage and rapid error decay for high-dimensional regimes (Mitchell et al., 2021).

These methods achieve better estimate precision per model evaluation and are robust to variance spikes inherent to MC with random permutations.

4. Theoretical Properties and Algebraic Structure

Permutation SHAP admits a rigorous operator-theoretic characterization:

The SHAP operator $S_{\pi,i}$ 4 (acting on $S_{\pi,i}$ 5, the space of measurable functions over $S_{\pi,i}$ 6) is such that $S_{\pi,i}$ 7 iff $S_{\pi,i}$ 8 is independent of $S_{\pi,i}$ 9 (on $i$ 0) (Bhattacharjee et al., 29 Mar 2025).
The algebra generated by value operators ( $i$ 1), termed the Shapley Lie algebra, is solvable and can be triangularized. This leads to explicit invertibility and approximation arguments underlying the soundness of Permutation SHAP (Bhattacharjee et al., 29 Mar 2025).

Robustness bounds establish that if the aggregate Permutation SHAP value of a feature is small, there exists a feature-independent surrogate for $i$ 2 with $i$ 3 error controlled by $i$ 4.

5. Extensions to Sequential and Non-i.i.d. Settings

Permutation SHAP has been adapted for sequential or position-sensitive models (e.g., natural language, time series) via algorithms such as OrdShap (Hill et al., 16 Jul 2025):

OrdShap introduces a matrix $i$ 5 capturing both value and position effects by averaging marginal contributions over all subsets and permutations conditioned on feature $i$ 6 occupying position $i$ 7.
The average over positions ( $i$ 8) recovers traditional value importance, while a linear fit over positions yields position importance ( $i$ 9).
The OrdShap permutation sampling scheme draws random subsets and random position assignments, using matching and masking to estimate the contribution of feature value and order.

This extension enables attribution methods to distinguish between the informativeness of feature values and their locations, an axis conflated in classical SHAP or permutation averaging approaches.

6. Practical Implementation and Guidelines

Empirical studies and theoretical analyses yield clear practical recommendations:

For global feature selection and interpretability, permute each feature (column-wise) independently across data rows and apply KernelSHAP or similar on the scrambled matrix to obtain robust aggregate SHAP. Retain features with $\pi$ 0 significantly greater than zero (Bhattacharjee et al., 29 Mar 2025).
For sample complexity, the cost and convergence of Permutation SHAP match that of ordinary KernelSHAP ( $\pi$ 1 in worst-case), but with stronger guarantees on interpretability and invariance.
For high-dimensional problems or when model evaluation is expensive, leverage COA or Latin Square DOE schemes to maximize the information gained per permutation and reduce estimator variance (Yang et al., 2023).
In applications to sequential data, employ OrdShap to attribute both value and positional importance, disentangling effects that standard Permutation SHAP conflates (Hill et al., 16 Jul 2025).

7. Summary Table of Main Permutation SHAP Methods

Method Class	Key Feature	Main Reference
Permuted data matrix	Column-wise permutation (μ*)	(Bhattacharjee et al., 29 Mar 2025)
Monte Carlo (feature perm.)	Random/unpaired, antithetic, paired	(Mayer et al., 18 Aug 2025, Mitchell et al., 2021)
DOE (COA/Latin Square)	Structured permutation sampling	(Yang et al., 2023)
Kernel Herding/SBQ	Mallows RKHS quadrature	(Mitchell et al., 2021)
OrdShap	Value vs position in sequences	(Hill et al., 16 Jul 2025)

Permutation SHAP thus underpins both recent theoretical advances in global attribution soundness and a suite of principled, flexible estimation strategies that address the computational and statistical challenges inherent to black-box feature importance in modern machine learning.