Mechanistic Anomaly Detection (MAD)

Updated 2 July 2026

Mechanistic Anomaly Detection (MAD) is a framework that distinguishes normal from anomalous internal processing in neural networks using trusted reference sets.
It employs functional attribution methods, including Bayesian Influence Functions, to quantify loss trace correlations and detect backdoors and adversarial manipulations.
Empirical evidence shows high detection accuracy across domains, with metrics such as DER and AUROC demonstrating its effectiveness in vision, language, and industrial applications.

Mechanistic Anomaly Detection (MAD) concerns the detection of atypical processing—or "mechanisms"—within data-driven systems, particularly deep neural networks. Unlike traditional anomaly detection, which flags deviations based on the statistical properties or correctness of output, MAD is fundamentally concerned with distinguishing whether an observed output arises from the model’s usual, trusted mechanisms or from an anomalous, potentially malicious or confounded process. This framework recasts anomaly detection as a question of internal mechanism attribution, with particular relevance in security, diagnostics, and foundation model reliability contexts (Keenan et al., 21 Apr 2026, Johnston et al., 9 Apr 2025).

1. Formal Definition and Theoretical Foundations

MAD is defined as follows: Given a trained neural network $f_{w^*}:\mathcal{X} \to \mathcal{Y}$ with parameter vector $w^*$ , a defender is provided a small "trusted reference set" $D_T = \{(x_i, y_i)\}$ known to elicit normal internal behavior. For any test input $x_{\text{test}}$ (possibly with unknown label), the goal is to determine whether its processing engages the same internal mechanisms as $D_T$ , or instead uses a distinct (e.g., backdoored or adversarial) mechanism (Keenan et al., 21 Apr 2026).

Causal extensions of MAD, as in root cause analysis, introduce a structural distinction between mechanistic anomalies—arising from shifts in the generating process—and measurement anomalies due to observational corruption. In such models, mechanistic anomalies are formalized as hard interventions $Z_{i,j}$ on latent variables, and measurement anomalies as interventions $W_{i,j}$ on observed variables. Identifiability of these sources is established up to conditional independence equivalence of the induced (mutilated) DAGs, under the faithfulness assumption (Suhr et al., 30 Jan 2026).

2. Functional Attribution via Influence Functions

The principal mechanism for MAD in neural networks is functional attribution—to what extent a sample from the trusted set explains (influences) the output or internal processing of a test sample.

Classical Influence Function: Measures influence as $\mathrm{IF}(z_i, \phi) = -\nabla_w \phi(w^*)^\top H^{-1} \nabla_w \ell(z_i; w^*)$ , with $H$ the Hessian of the loss. This is typically intractable for deep nets (Keenan et al., 21 Apr 2026).
Bayesian Influence Function (BIF): Replaces point estimates with posterior averages: $\mathrm{BIF}(z_i, \phi) = -\operatorname{Cov}_{w \sim p_\beta(w|D)}[\ell(z_i; w), \phi(w)]$ . No Hessian inverse is required; posterior sampling, often via SGLD, is used (Keenan et al., 21 Apr 2026).
Localized BIF: Focuses on the trained $w^*$ 0 by introducing a Gaussian "probe" prior centered at $w^*$ 1.

The functional coupling between test and trusted samples is operationalized as the Pearson or related correlation of their "loss traces"—the sequence of loss values across posterior parameter samples.

3. MAD Algorithmic Workflow and Scoring

The generic MAD workflow comprises:

Reference Set Construction: Split clean data into a sampling set $w^*$ 2 (for SGLD) and trusted set $w^*$ 3 (defines normal mechanisms).
Observable Definition: For $w^*$ 4, define $w^*$ 5, with $w^*$ 6.
Sampling: Run SGLD chains from $w^*$ 7 using $w^*$ 8, collect loss traces for all $w^*$ 9.
Correlation: Compute correlation (default: Pearson) between the loss trace of $D_T = \{(x_i, y_i)\}$ 0 and each $D_T = \{(x_i, y_i)\}$ 1.
Aggregation: Aggregate correlations over the trusted set (mean or class-clustered strategies).
Thresholding: Assign normal or anomalous status based on thresholded detection score (Keenan et al., 21 Apr 2026).

The method is modality-agnostic, requiring only loss/gradient tools, and does not rely on particular architecture or perturbation schemes.

4. Evaluation Metrics and Empirical Results

The central metric in MAD for classifier backdoors is the Defense Effectiveness Rating (DER):

$D_T = \{(x_i, y_i)\}$ 2

where $D_T = \{(x_i, y_i)\}$ 3 is attack success rate reduction, and $D_T = \{(x_i, y_i)\}$ 4 is clean accuracy drop after detection (Keenan et al., 21 Apr 2026).

Empirical results:

Vision Backdoors: On BackdoorBench (CIFAR-10, CIFAR-100, GTSRB, Tiny-ImageNet), MAD achieves an average DER of 0.93 across attacks; offline UMAP-based methods push DER to ≈0.97.
Language Backdoors: For Gemma 2-2B and Llama 3.1 8B, AUROC is 0.99–1.00 for simple triggers, ≈0.98 for complex triggers, and 1.00 even with obfuscated backdoors—where latent-space detectors fail.
Other Pathologies: For adversarial/OOD, pairwise loss-trace correlations cluster clean, backdoor, adversarial, and OOD samples distinctly; AUROC ≈ 86% (near-OOD), ≈94% (far-OOD) in benchmarks.
Mechanistic Multiplicity: Dual-backdoor LLMs yield three discrete clusters corresponding to benign and both attack modes (Keenan et al., 21 Apr 2026).

Alternative MAD frameworks for time series (MAD-Transformer) and causal process data (structural MLE) exhibit high F1, recall, and localization accuracy, often surpassing multi-branch attention and kernel density estimation baselines (Sun et al., 2024, Suhr et al., 30 Jan 2026).

5. Mechanistic Anomaly Detection in LLMs

"Mechanistic Anomaly Detection for 'Quirky' LLMs" investigates MAD in the context of LLMs, using a range of internal features:

Internal Features: Residual activations at each layer, attribution features (via mean/PC/gradient ablations), probe accuracy shifts, SAE latent activations, and normalizing-flow-derived Mahalanobis/Laplace scores.
Scoring Rules: Both online (Mahalanobis, LOF, SAE $D_T = \{(x_i, y_i)\}$ 5) and offline (quantum entropy, likelihood ratio, GMM) regimes are studied.
Evaluation: AUROC is the primary measure. Arithmetic tasks see near-perfect detection performance; non-arithmetic tasks (sentiment, SciQ, NLI) yield substantially lower detection rates.
Limitations: Detector consistency varies across tasks and models; reliance on activation separability, label-imbalance susceptibility, and attribution computation cost are major factors.
Future Directions: Development of end-to-end anomaly networks, meta-learned thresholds, and high-fidelity overseer benchmarks are identified as next steps (Johnston et al., 9 Apr 2025).

Causal MAD distinguishes mechanistic (process-shifted) from measurement (observationally-shifted) anomalies by modeling both as explicit interventions on distinct nodes within the underlying causal DAG. This distinction enables root cause localization and type classification using maximum likelihood estimation over assignments of anomaly indicators $D_T = \{(x_i, y_i)\}$ 6, achieving top- $D_T = \{(x_i, y_i)\}$ 7 recall up to 0.98 and classification accuracy near 90% in synthetic and real-world (Sachs, Causal Chambers, NYC Taxi) settings (Suhr et al., 30 Jan 2026).

For industrial cyber-physical systems, MAD-Transformer leverages temporal and spatial state matrices and structured three-branch attention, combining alignment and reconstruction objectives for interpretable, fine-grained anomaly detection and localization, with superior F1 and localization performance across service monitoring, earth exploration, and water treatment datasets (Sun et al., 2024).

7. Strengths, Limitations, and Directions for Advancement

Strengths

Modality-agnostic: No reliance on specific network architectures or input domain properties.
Obfuscation-robust: Mechanism-based signatures cannot be trivially masked by latent space manipulation.
Theoretically principled: Coupling and anomaly detection grounded in Hessian eigenspace analysis and structural identifiability.
High empirical precision for backdoor and adversarial detection, with state-of-the-art metrics across benchmarks.

Limitations

Computational overhead: SGLD runs and correlation evaluations scale as $D_T = \{(x_i, y_i)\}$ 8, though convergence at $D_T = \{(x_i, y_i)\}$ 9 mitigates this in practice.
Trusted reference dependency: Clean, mechanism-verified samples are required, but performance degrades gracefully with minor contamination.
Hyperparameter tuning: SGLD parameters ( $x_{\text{test}}$ 0) require calibration, but are stable across a broad range.

Ongoing and Future Work

Efficient posterior sampling: Higher-order integrators and low-rank approximations.
Ensemble detection: Fusion with latent space and alternative anomaly detection methods.
Scaling to larger models: Empirical investigation on larger tasks, reference distribution construction.
Generalized observables: Use of more sensitive outputs (e.g., KL divergences) to improve detection of adversarial or out-of-distribution cases.
End-to-end training: For LLMs, rich anomaly networks and meta-learned thresholds.

Mechanistic Anomaly Detection thus represents a rigorous, flexible framework for safeguarding neural models and complex data-driven systems against illicit or confounding internal mechanism shifts, with demonstrated tractability and precision across modalities and problem domains (Keenan et al., 21 Apr 2026, Suhr et al., 30 Jan 2026, Johnston et al., 9 Apr 2025, Sun et al., 2024).