EnsembleSHAP: Faithful and Certifiably Robust Attribution for Random Subspace Method

Published 31 Mar 2026 in cs.CR | (2603.30034v1)

Abstract: Random subspace method has wide security applications such as providing certified defenses against adversarial and backdoor attacks, and building robustly aligned LLM against jailbreaking attacks. However, the explanation of random subspace method lacks sufficient exploration. Existing state-of-the-art feature attribution methods, such as Shapley value and LIME, are computationally impractical and lacks security guarantee when applied to random subspace method. In this work, we propose EnsembleSHAP, an intrinsically faithful and secure feature attribution for random subspace method that reuses its computational byproducts. Specifically, our feature attribution method is 1) computationally efficient, 2) maintains essential properties of effective feature attribution (such as local accuracy), and 3) offers guaranteed protection against privacy-preserving attacks on feature attribution methods. To the best of our knowledge, this is the first work to establish provable robustness against explanation-preserving attacks. We also perform comprehensive evaluations for our explanation's effectiveness when faced with different empirical attacks, including backdoor attacks, adversarial attacks, and jailbreak attacks. The code is at https://github.com/Wang-Yanting/EnsembleSHAP. WARNING: This document may include content that could be considered harmful.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces EnsembleSHAP, a method computing feature attributions for RSM with formal guarantees against explanation-preserving attacks.
It leverages Monte Carlo sampling with normalization to produce attribution scores that meet axiomatic properties like local accuracy, symmetry, and order consistency.
Experimental results on text and image tasks demonstrate superior detection of adversarial manipulations compared to standard SHAP and LIME, with significant improvements in both speed and robustness.

EnsembleSHAP: Certifiably Robust Attribution for the Random Subspace Method

Introduction and Motivation

Attribution methods for ensemble predictors based on the Random Subspace Method (RSM, also known as attribute bagging) are foundational for explainability in high-stakes security and robust alignment applications. RSM exhibits strong empirical and certified robustness to adversarial and backdoor attacks, as well as recent alignment-breaking (e.g., LLM jailbreaking) attacks. However, the mainstay explanation techniques, notably SHAP and LIME, neither scale to ensembles constructed by RSM due to exponential query cost, nor provide theoretical defense guarantees under explanation-preserving attacks—i.e., input manipulations that cause misclassification while maintaining the pre-attack explanation profile.

EnsembleSHAP addresses these deficiencies by designing an attribution protocol, intrinsic to the RSM inference process, that is computationally efficient and is accompanied by formal guarantees of adversarial feature detectability under explanation-preserving attack models. This contribution addresses an under-explored explainability axis for certifiably robust classifiers and aligned LLMs.

Methodology

The core algorithmic principle of EnsembleSHAP is to define feature attribution for RSM ensembles via the marginal contribution of each feature across sampled subspaces, specifically characterizing the probability that feature $i$ is present in a random subspace and that its inclusion leads to the prediction of label $y$ for the input $x$ :

$\alpha_i(x, h, k) = \mathbb{E}_{z \sim U(x,k)}[\mathbb{I}(x_i \in z) \cdot \mathbb{I}(h(z) = y)]$

where $U(x,k)$ denotes the uniform distribution over $k$ -sized subsets of the $d$ input features. The method leverages the fact that these subspace predictions are already computed during RSM inference, rendering attribution essentially a free computational byproduct. To correct for the non-uniform sampling of feature occurrence in small- $N$ Monte Carlo sampling, normalization by feature appearance frequency is applied.

Key design goals are:

Computational Efficiency: The attribution scores are functions of the already available base model outputs on subsampled groups.
Axiomatic Faithfulness: The method is proven to satisfy local accuracy (sum of attributions equals ensemble prediction probability for $y$ ), symmetry, and order-consistency (with Shapley value).
Certified Robustness: Formal bounds (Theorem 1) guarantee that, for any attack modifying at most $T$ features to flip the ensemble prediction, at least $y$ 0 of the adversarial features are present in the top- $y$ 1 features ranked by their attribution scores, where $y$ 2 is a function of the margin, subspace parameters, and feature cardinality.

Theoretical Contributions

The authors derive, for the first time, provable lower bounds on the intersection size between adversarially manipulated features and the most important features flagged by the explanation, for arbitrary attack budgets. The proofs crucially exploit the combinatorial properties of subspace sampling and the effect of sparsity and overlap in feature groups on prediction flips. Notably, the guarantees hold for both text and image domains, and for both lo-norm sparse attacks (e.g., word or patch substitutions) and alignment-breaking scenarios. The method relaxes the Shapley value dummy property in favor of direct order consistency, better reflecting the operational needs of feature ranking rather than absolute score calibration.

Experimental Validation

Empirical analysis is conducted on language and image classification tasks, including SST-2, IMDB, AGNews, CIFAR-10, ImageNette, and ImageNet-100, as well as LLM jailbreaking datasets. Targeted attack benchmarks comprise BadNets (backdoor), TextFooler (textual adversary), and several advanced LLM jailbreakers (DAN, AutoDAN, GCG, JAM, AIR).

EnsembleSHAP consistently outperforms standard and LLM-based attribution baselines in faithfulness and key-word/patch recovery, especially in the presence of attacks.

In adversarial scenarios (IMDB/TextFooler), removing the top-10% words as determined by EnsembleSHAP flips the ensemble prediction with probability $y$ 3, whereas Shapley and LIME applied to the base model are suboptimal (flip rate $y$ 4).
In backdoor settings, EnsembleSHAP achieves recall up to $y$ 5 on trigger word detection—substantially above all baselines.
In image patch attacks, detection rates remain above $y$ 6 for $y$ 7 on ImageNette.

Computation is orders of magnitude faster than base-model SHAP or LIME, requiring only $y$ 8s per attribution instance (text or image), and scalability holds even for large feature cardinalities.

Certified Detection and Robustness Analysis

A strong claim is established: EnsembleSHAP is the first feature attribution method for RSM with provable guarantees against explanation-preserving attacks. Under formalized adversarial models, the attribution highlights manipulated features with lower-bound detectability, parameterized by the attack budget $y$ 9, ensemble confidence, and subsample size $x$ 0. This renders the method certifiably robust—the explanation cannot be subverted to hide the adversarial features responsible for prediction changes, a property not shared by standard black-box or even white-box attribution methods.

Practical and Theoretical Implications

The method directly advances the transparency and audit capability of robust ML systems, especially where explanations affect security postures, e.g., forensic analysis of jailbroken LLM responses or misclassification diagnostics in high-assurance workflows. Theoretically, this framework establishes how explanation guarantees can be made compatible with certified robustness protocols, and opens new research directions concerning explainability for privacy-preserving and unlearning settings, and the development of robust attribution methods for more general (non-RSM) learners.

Conclusion

EnsembleSHAP constitutes an attribution paradigm tailored to ensemble classifiers constructed via the Random Subspace Method. It combines computational efficiency, axiomatic explainability, and—most critically—certified robustness to adversarial explanation-preserving attacks. The work shifts the explainability landscape for robust and secure ML, demonstrating that faithful and certifiably robust feature attribution is achievable for ensemble-based defenses in both NLP and vision. Potential future research includes extension to privacy-respecting models and generalization to other ensemble architectures.

Markdown Report Issue