Activation Perturbation for Exploration (APEX)
- The paper introduces APEX as a probing technique that injects Gaussian noise into hidden activations to interpolate between input-sensitive and model-driven behaviors.
- APEX employs Monte Carlo sampling over multiple noise scales to compute escape noise, providing precise diagnostics for sample regularity, semantic alignment, and backdoor detection.
- By modulating noise levels, APEX enables researchers to gain both local insights and global bias analysis without needing model retraining or ensemble methods.
Activation Perturbation for EXploration (APEX) is an inference-time probing paradigm for neural networks that systematically injects Gaussian noise into hidden activations, while holding both the model input and parameters fixed. APEX is designed to address limitations inherent in input-space and parameter perturbation approaches, providing a direct lens into the structure and regularities encoded in intermediate network representations. By varying the noise scale, APEX enables a controlled transition from sample-dependent, input-sensitive responses to model-driven, input-agnostic behaviors, offering both local and global perspectives on network decision processes (Ren et al., 3 Feb 2026).
1. Formalism and Algorithmic Specification
Consider an -layer feed-forward network with pre-activations and post-activations given by
where . APEX introduces additive Gaussian noise to each post-activation at inference:
for each and noise scale . The final logits are
and the predicted label is
The empirical output distribution is estimated via Monte Carlo forward passes:
APEX thus interpolates between sample-dependent () and model-dependent () response regimes.
Algorithmically, for each input , each chosen noise scale , and each of forward passes:
- Independently sample for all
- Inject after each layer’s activation in the network
- Record the resulting top-1 class
- Aggregate to yield
The escape noise for input is defined as the minimal where the original prediction’s probability drops below a fixed threshold .
2. Theoretical Framework and Decomposition
At the core of APEX is a decomposition theorem applicable for any with and all :
where is a function solely of the network parameters and the sampled noise up to layer , while the residual is uniformly bounded in norm. At the output layer,
The prediction simplifies in the large-noise limit:
Thus, as , predictions become independent of and depend only on random features . This demonstrates that APEX suppresses input-specific signals and amplifies the structural, representation-level aspects embedded in the model.
Input perturbation, , is shown to be a constrained form of activation perturbation, as its induced change at layer is
where is the Jacobian; input noise thus spans a low-dimensional subspace of the activation space, in contrast to the full-variance, unconstrained perturbations of APEX.
3. Probing Regimes and Interpretive Phenomena
There exists a qualitative dichotomy between small- and large-noise regimes:
- Small-Noise Regime (): The residual dominates. Predictions remain input-sensitive. In this regime, escape noise correlates strongly with sample regularity metrics such as memorization score and consistency/C-score (Spearman’s –$0.9$ on ImageNet and CIFAR-100). APEX detects smooth semantic transitions in networks trained on controlled splits, exhibiting monotonic probability transfer that aligns with learned representations.
- Large-Noise Regime (): The term becomes dominant, rendering predictions input-agnostic. The network's output converges to a stationary, model-characteristic distribution. This regime exposes global biases: benign models exhibit high entropy output distributions, whereas backdoored models demonstrate collapse of output probability onto the target class (near-zero entropy).
The framework enables computation of normalized entropy,
and quantification of target-class concentration, serving as diagnostics for backdoor detection and capacity-induced bias amplification.
4. Empirical Evaluations and Case Studies
APEX has been systematically validated through distinct probes:
| Probe Type | Quantitative Outcome | Interpretation |
|---|---|---|
| Sample Regularity | Spearman’s –$0.9$ with memorization score | Effective, lightweight alternative to ensembles |
| Random-Label Models | Average escape noise decreases with more random labeling | Reveals fragmented, non-semantic decision regions |
| Semantic Alignment | Monotonic class transfer under activation noise only | Confirms structure in representation space |
| Backdoor Detection | Target class –$1.0$ vs. –$0.2$ (benign); entropy collapse | Captures global, training-induced bias |
| Model Architecture Sensitivity | Deeper ResNets: stronger probability collapse; ViTs: partial, attenuated collapse | Architecture-dependent bias revelation |
In all cases, input- or parameter-level perturbations fail to exhibit the monotonicity, transition alignment, or diagnostic sharpness furnished by APEX.
5. Computational Complexity and Practical Implementation
Each estimation for APEX requires full forward passes per input; is typical for CIFAR and for ImageNet. The noise injection itself constitutes a layerwise vector addition, which introduces minimal computational overhead. No model retraining or ensemble construction is necessary; all analysis occurs at inference with fixed weights and is trivially parallelizable over both examples and Monte Carlo samples. Choice of noise scale and threshold for metrics such as escape noise allows sensitivity–cost trade-offs.
APEX complements input and parameter perturbation probes—by acting directly on hidden representations, it accesses structural information that cannot be inferred from reachable input space alone.
6. Methodological Distinction and Conceptual Scope
APEX unifies local sample analysis and global model bias probing within a single, theoretically grounded framework. It admits input perturbation as a constrained, degenerate instance, subsuming prior approaches in expressive power. The ability to interpolate between regimes by modulating enables fine-grained scrutiny of memorization, regularity, semantic partitioning, and bias-induced collapse, revealing properties inaccessible to traditional probing techniques. The method's lightweight computational profile combined with its ability to interrogate internal network structure positions it as an effective probe for model interpretability, robustness diagnostics, and backdoor detection (Ren et al., 3 Feb 2026).