Manipulation Explanation (ME) in XAI

Updated 11 October 2025

Manipulation Explanation (ME) is an adversarial tactic affecting post hoc XAI methods by using minimal input perturbations to force explanation outputs to align with attacker-defined targets.
It exploits vulnerabilities in methods like SHAP, LIME, and Integrated Gradients, altering feature rankings without changing the model's prediction.
ME poses serious risks in domains such as cybersecurity and finance, necessitating robust defenses like adversarial retraining and ensemble strategies.

Manipulation Explanation (ME) refers to a class of adversarial attacks and vulnerabilities in explainable machine learning where, through carefully constructed input or model perturbations, the output of an explanation method can be forced to align with an attacker’s arbitrary or predetermined target—without changing the model’s output. This phenomenon calls into question the reliability, transparency, and trustworthiness of post hoc explanation techniques, particularly in high-stakes domains such as cybersecurity, finance, and healthcare, where human stakeholders rely on explanation outputs to understand, validate, or contest model behavior.

1. Definition and Formal Mechanism

Manipulation Explanation (ME) is defined as an adversarial tactic affecting post hoc XAI methods such as SHAP, LIME, and Integrated Gradients (IG). The objective is to induce the explanation module to produce a manipulated, attacker-selected output, regardless of the actual reasoning used by the underlying model.

This is typically accomplished by applying subtle, often imperceptible, perturbations to the input $\mathbf{x} \to \mathbf{x}'$ such that:

The model prediction $f(\mathbf{x}) \approx f(\mathbf{x}')$
The resulting explanation $g(\mathbf{x}', f)$ deviates significantly from the original explanation $g(\mathbf{x}, f)$ and, crucially, is close to some attacker-defined target $g_\text{adv}$

This is formalized as:

$\text{Dist}(g', g) > \delta \quad \text{and} \quad \text{Dist}(g', g_\text{adv}) < \epsilon$

where $\delta$ ensures that the genuine explanation is replaced and $\epsilon$ constrains the manipulated explanation to the desired, attacker-chosen target (Mia et al., 4 Oct 2025). The typical attack flow involves:

Generating a minimal adversarial perturbation to maximize divergence from the genuine explanation
Ensuring prediction invariance such that the model’s output, and hence its operational reliability, remains intact

This is distinct from “fairwashing” (where the aim is to minimize the presence or attribution of sensitive features) or “backdoor” attacks (which trigger explanation manipulation only in the presence of special input patterns).

2. Vulnerability of Post-hoc Explanation Methods

ME attacks leverage the underlying assumptions of local surrogate explainers and gradient-based attribution methods. For LIME/SHAP:

These methods construct surrogate models or compute attributions by perturbing input features or generating neighborhood samples
An adversary can exploit this by ensuring the model (or a “scaffolding” model) is fair or benign only on perturbed (possibly out-of-distribution) samples, but remains biased elsewhere (Slack et al., 2021)

For IG and similar gradient-based methods:

The sensitivity of the explanation to small perturbations is governed by the local geometry (the Hessian or Lipschitz constant of the gradient map)
Even vanishingly small, targeted changes can cause dramatic shifts in attribution maps while preserving output (Dombrowski et al., 2019)

In summary, many commonly used post hoc XAI tools (SHAP, LIME, IG) can be forced, via either model adjustments or input perturbation, to yield explanations that conform to arbitrary, attacker-defined outputs without modifying real model performance (Mia et al., 4 Oct 2025, Viering et al., 2019).

3. Implications for Trust and Transparency

The central risk posed by Manipulation Explanation is the erosion of trust and transparency. In scenarios such as:

Malware or phishing detection: a manipulated explanation may highlight benign instead of malicious features, misleading forensic analysts (Mia et al., 4 Oct 2025)
Intrusion/fraud detection: manipulated attributions obscure the operational signal, making it more difficult for analysts to diagnose or validate alerts
Auditing for discrimination: ME attacks can hide evidence of bias by ensuring that sensitive features are downplayed or omitted in the produced explanations (Slack et al., 2021)

Empirically, even relatively minor changes in feature ranking (Spearman’s $\approx$ 0.92 between original and manipulated rankings) can critically impair interpretability, especially where stakeholders place unconditional trust in the explanation (Mia et al., 4 Oct 2025). In environments where explanations are required for regulatory compliance or risk assessment, this represents a critical operational vulnerability.

4. Experimental Findings

Studies on real-world cybersecurity datasets (phishing, malware, intrusion detection, e-commerce fraud) demonstrate:

In data poisoning attacks, ME can alter SHAP/LIME/IG explanations such that top-ranked feature attributions are moved, even if only among adjacent features. For example, in malware detection, manipulated SHAP maps showed rank correlations of $0.9227$ with originals—indicating substantial but not total reordering (Mia et al., 4 Oct 2025).
In black-box attack scenarios, measured Attack Success Rate (ASR) remains low in some settings (e.g., 8%) and rank correlations as high as $0.975$, which still represent meaningful but subtle manipulation.

Compared to other explanation manipulation tactics (such as fairwashing or tailored backdoors), ME attacks in these studies achieved more modest numeric changes, but any deviation can be problematic if relied upon operationally.

5. Cybersecurity and Domain-Specific Consequences

Manipulation Explanation attacks present notable risks in applied settings:

Forensic validation in malware/phishing/intrusion use cases can be rendered unreliable if manipulated explanations downplay relevant malicious patterns or highlight irrelevant signals (Mia et al., 4 Oct 2025).
Because XAI is increasingly used as a human-aligned interface to complex detection models, an attacker capable of ME can direct analysts' attention away from genuine threats or toward “cover stories,” thereby undermining both system security and operational trust.

The opacity of such attacks is accentuated by their subtlety; many output explanations appear plausible, and the core prediction remains unaltered, making anomalous explanations difficult to detect without secondary validation mechanisms.

6. Defensive Strategies and Future Directions

Research recommends several partial mitigations:

Adversarial retraining using perturbed data can harden models and explanation systems to such attacks (Mia et al., 4 Oct 2025).
Data filtering to exclude out-of-distribution samples reduces the success chance of surrogate manipulation (especially for LIME/SHAP, which are sensitive to sample distribution) (Slack et al., 2021).
Multi-explainer and ensemble strategies: deploying multiple, orthogonal explanation methods may expose inconsistencies if any single approach is manipulated. Cross-validating explanations can signal when an attack is underway (Mia et al., 4 Oct 2025).
Federated or decentralized learning frameworks may help defend against global model manipulation (where ME is enacted via model re-tuning rather than input), by making it harder for an attacker to centrally change the explanation landscape for all users.

A plausible implication is that as interpretability methods become central to auditing, regulatory compliance, and operational forensics, robustness against manipulation explanation attacks will be essential. Ongoing research is needed both to detect attacker-induced explanation divergence and to formalize properties (such as explanation stability under small perturbations) as a criterion for trustworthiness in XAI (Mia et al., 4 Oct 2025, Tang et al., 2021, Dombrowski et al., 2019).

7. Summary Table: ME Attack Impact Across Domains

Domain	Attack Mechanism	XAI Method	Observed Effect
Phishing detection	Data poisoning, sparse input perturbation	SHAP, LIME	Explanations highlight benign rather than malicious features
Malware detection	Adversarial input, model manipulation	SHAP, LIME, IG	Shift in feature ranking (Spearman’s $\sim$ 0.92)
Intrusion/fraud	Input-level ME attacks	SHAP, LIME	Concealment of key behavioral/transactional features
Model auditing	Surrogate behavior splitting	LIME, SHAP	Sensitive attributes omitted from top attributions, masking bias

These results underline the pervasive and cross-domain challenges posed by ME attacks.

Manipulation Explanation (ME) represents a class of adversarial tactics against XAI, exposing vulnerabilities in core interpretability workflows. By remapping explanations through minimal input or model-level changes, ME can undermine the intended transparency and auditability of machine learning systems, endangering critical applications and eroding stakeholder trust. The literature illustrates both the technical mechanisms and the societal threats engendered by ME, and calls for a concerted focus on explanation robustness in future XAI research and deployment.