- The paper introduces EnsembleSHAP, a method computing feature attributions for RSM with formal guarantees against explanation-preserving attacks.
- It leverages Monte Carlo sampling with normalization to produce attribution scores that meet axiomatic properties like local accuracy, symmetry, and order consistency.
- Experimental results on text and image tasks demonstrate superior detection of adversarial manipulations compared to standard SHAP and LIME, with significant improvements in both speed and robustness.
EnsembleSHAP: Certifiably Robust Attribution for the Random Subspace Method
Introduction and Motivation
Attribution methods for ensemble predictors based on the Random Subspace Method (RSM, also known as attribute bagging) are foundational for explainability in high-stakes security and robust alignment applications. RSM exhibits strong empirical and certified robustness to adversarial and backdoor attacks, as well as recent alignment-breaking (e.g., LLM jailbreaking) attacks. However, the mainstay explanation techniques, notably SHAP and LIME, neither scale to ensembles constructed by RSM due to exponential query cost, nor provide theoretical defense guarantees under explanation-preserving attacks—i.e., input manipulations that cause misclassification while maintaining the pre-attack explanation profile.
EnsembleSHAP addresses these deficiencies by designing an attribution protocol, intrinsic to the RSM inference process, that is computationally efficient and is accompanied by formal guarantees of adversarial feature detectability under explanation-preserving attack models. This contribution addresses an under-explored explainability axis for certifiably robust classifiers and aligned LLMs.
Methodology
The core algorithmic principle of EnsembleSHAP is to define feature attribution for RSM ensembles via the marginal contribution of each feature across sampled subspaces, specifically characterizing the probability that feature i is present in a random subspace and that its inclusion leads to the prediction of label y for the input x:
αi(x,h,k)=Ez∼U(x,k)[I(xi∈z)⋅I(h(z)=y)]
where U(x,k) denotes the uniform distribution over k-sized subsets of the d input features. The method leverages the fact that these subspace predictions are already computed during RSM inference, rendering attribution essentially a free computational byproduct. To correct for the non-uniform sampling of feature occurrence in small-N Monte Carlo sampling, normalization by feature appearance frequency is applied.
Key design goals are:
- Computational Efficiency: The attribution scores are functions of the already available base model outputs on subsampled groups.
- Axiomatic Faithfulness: The method is proven to satisfy local accuracy (sum of attributions equals ensemble prediction probability for y), symmetry, and order-consistency (with Shapley value).
- Certified Robustness: Formal bounds (Theorem 1) guarantee that, for any attack modifying at most T features to flip the ensemble prediction, at least y0 of the adversarial features are present in the top-y1 features ranked by their attribution scores, where y2 is a function of the margin, subspace parameters, and feature cardinality.
Theoretical Contributions
The authors derive, for the first time, provable lower bounds on the intersection size between adversarially manipulated features and the most important features flagged by the explanation, for arbitrary attack budgets. The proofs crucially exploit the combinatorial properties of subspace sampling and the effect of sparsity and overlap in feature groups on prediction flips. Notably, the guarantees hold for both text and image domains, and for both lo-norm sparse attacks (e.g., word or patch substitutions) and alignment-breaking scenarios. The method relaxes the Shapley value dummy property in favor of direct order consistency, better reflecting the operational needs of feature ranking rather than absolute score calibration.
Experimental Validation
Empirical analysis is conducted on language and image classification tasks, including SST-2, IMDB, AGNews, CIFAR-10, ImageNette, and ImageNet-100, as well as LLM jailbreaking datasets. Targeted attack benchmarks comprise BadNets (backdoor), TextFooler (textual adversary), and several advanced LLM jailbreakers (DAN, AutoDAN, GCG, JAM, AIR).
EnsembleSHAP consistently outperforms standard and LLM-based attribution baselines in faithfulness and key-word/patch recovery, especially in the presence of attacks.
- In adversarial scenarios (IMDB/TextFooler), removing the top-10% words as determined by EnsembleSHAP flips the ensemble prediction with probability y3, whereas Shapley and LIME applied to the base model are suboptimal (flip rate y4).
- In backdoor settings, EnsembleSHAP achieves recall up to y5 on trigger word detection—substantially above all baselines.
- In image patch attacks, detection rates remain above y6 for y7 on ImageNette.
Computation is orders of magnitude faster than base-model SHAP or LIME, requiring only y8s per attribution instance (text or image), and scalability holds even for large feature cardinalities.
Certified Detection and Robustness Analysis
A strong claim is established: EnsembleSHAP is the first feature attribution method for RSM with provable guarantees against explanation-preserving attacks. Under formalized adversarial models, the attribution highlights manipulated features with lower-bound detectability, parameterized by the attack budget y9, ensemble confidence, and subsample size x0. This renders the method certifiably robust—the explanation cannot be subverted to hide the adversarial features responsible for prediction changes, a property not shared by standard black-box or even white-box attribution methods.
Practical and Theoretical Implications
The method directly advances the transparency and audit capability of robust ML systems, especially where explanations affect security postures, e.g., forensic analysis of jailbroken LLM responses or misclassification diagnostics in high-assurance workflows. Theoretically, this framework establishes how explanation guarantees can be made compatible with certified robustness protocols, and opens new research directions concerning explainability for privacy-preserving and unlearning settings, and the development of robust attribution methods for more general (non-RSM) learners.
Conclusion
EnsembleSHAP constitutes an attribution paradigm tailored to ensemble classifiers constructed via the Random Subspace Method. It combines computational efficiency, axiomatic explainability, and—most critically—certified robustness to adversarial explanation-preserving attacks. The work shifts the explainability landscape for robust and secure ML, demonstrating that faithful and certifiably robust feature attribution is achievable for ensemble-based defenses in both NLP and vision. Potential future research includes extension to privacy-respecting models and generalization to other ensemble architectures.