Papers
Topics
Authors
Recent
Search
2000 character limit reached

LSA-Probe: Latent Stability in Neural Models

Updated 10 February 2026
  • LSA-Probe is a diagnostic method that measures the fragility of neural networks by applying minimal adversarial perturbations in latent spaces.
  • It quantifies model behavior by computing changes in negative log-likelihood, providing clear metrics of robustness and vulnerability.
  • Empirical evaluations show its effectiveness in enhancing safety in both large language models and diffusion-based generative models.

The Latent Stability Adversarial Probe (LSA-Probe) is a diagnostic and adversarial methodology for quantifying the local stability—or fragility—of neural network models to perturbations within their latent or hidden state spaces. LSA-Probe has been developed and applied in settings ranging from LLM safety alignment to membership inference attacks in diffusion-based generative modeling. Its core mechanism is to characterize model behavior under carefully crafted, often minimal, latent perturbations, providing a metric of robustness, a vector for adversarial attack, and a blueprint for defensive retraining.

1. Formal Specification and Theoretical Foundations

The LSA-Probe framework is designed to operate on autoregressive models (such as LLMs) and diffusion models, targeting their hidden activations or intermediate latent representations. The crucial insight is that well-aligned or well-trained models can still exhibit significant sensitivity to minor deviations in these internal states, which may result in unsafe outputs, membership exposure, or other undesirable behaviors (Gu et al., 19 Jun 2025, Liu et al., 2 Feb 2026).

In LLM alignment, LSA-Probe is formalized as the negative log-likelihood (NLL) of a model’s original “safe” output yy given a prompt xx and model parameters θ\theta:

NLL(x,y)=t=1ylogπθ(ytx,y<t)\mathrm{NLL}(x, y) = -\sum_{t=1}^{|y|}\log \pi_\theta(y_t \mid x, y_{<t})

Here, πθ\pi_\theta is the model’s token probability, and y|y| is the response length. After applying an adversarial perturbation δ\delta to the hidden activation ht()h_t^{(\ell)} at layer \ell and step tt, the increase in NLL

ΔNLL=NLLperturbedNLLclean\Delta \mathrm{NLL} = \mathrm{NLL}_{\text{perturbed}} - \mathrm{NLL}_{\text{clean}}

quantifies the loss of stability or confidence in the safe response. A similar framework is adapted for diffusion models by measuring the minimal perturbation (budget) needed to induce a perceptual degradation in the reconstruction of an input sample (Liu et al., 2 Feb 2026).

2. Latent Perturbation Procedure and Algorithmic Recipe

In practice, LSA-Probe operationalizes this concept via intentional, statistically controlled perturbations in chosen latent subspaces or layers. For LLMs, the injection proceeds as follows:

  • For each target layer \ell and prompt-response pair (x,y)(x, y), the clean forward pass yields ht()h_t^{(\ell)} and base NLL.
  • For multiple sampled directions (MM times), a perturbation δN(0,Id)\delta \sim \mathcal{N}(0, I_d) (matching the layer’s hidden size dd) is normalized to the mean and standard deviation of ht()h_t^{(\ell)}.
  • The adjusted perturbation δ\delta' is injected additively into all ht()h_t^{(\ell)} across steps, and the NLL is recalculated.

Pseudocode summary (using the notation of (Gu et al., 19 Jun 2025)):

1
2
3
4
5
6
7
8
9
for each layer l in L:
    record activations h_t^{(l)} and compute clean NLL_0
    for m in 1...M:
        delta ~ N(0, I_d)
        delta' = mu(h_t^{(l)}) + (delta - mu(delta))/sigma(delta) * sigma(h_t^{(l)})
        for t in 1...|y|:
            h_t^{(l)} += delta'
        NLL_m = recompute NLL under perturbed logits
        Delta_NLL_m = NLL_m - NLL_0

The maximal ΔNLL\Delta \mathrm{NLL} across layers and directions serves as a global “fragility” score.

In diffusion models, the probe asks for the minimal adversarial budget η\eta (in appropriately normalized latent space) required to exceed a fixed perceptual degradation τ\tau between the original and perturbed reconstructions. The solution employs projected gradient descent (PGD) with norm constraints and outer bisection to identify the point where D(x^0,x^0δ~)τD(\hat{x}_0, \hat{x}_0^{\tilde{\delta}}) \geq \tau, with DD a perceptual distance (e.g., CDPAM, MR-STFT) (Liu et al., 2 Feb 2026).

3. Adversarial Exploitation: Activation Steering Attack (ASA)

LSA-Probe not only quantifies but also actively exploits latent fragility through the Activation Steering Attack (ASA), which identifies “jailbreak” trajectories in model activation space (Gu et al., 19 Jun 2025). There are two principal variants:

  • ASA_random: A sampled and normalized perturbation δ\delta' is injected at the most fragile layer repeatedly across all generation steps, which progressively drives model outputs toward unsafe continuations via autoregressive drift.
  • ASA_grad: Identifies the true adversarial direction by backpropagating from a target harmful suffix yy^* (e.g., “how to build a bomb…”) to obtain the gradient gg with respect to internal activations. The adversarial perturbation is constructed as δ=αsign(g)\delta' = \alpha \,\mathrm{sign}(g) and applied to the original prompt’s activations, strongly biasing towards the target unsafe output.

Both approaches are explicitly formulated to maximize the NLL of the original safe reply, providing a principled and generative construction of attack vectors:

maxδ:  δstat=0[tlogπθ(ytx,y<t;δ)]\max_{\delta:\;\|\delta\|_{\mathrm{stat}}=0} \left[ -\sum_{t}\log\,\pi_\theta(y_t\mid x,y_{<t};\delta)\right]

4. Defensive Mechanisms: Layer-wise Adversarial Patch Training (LAPT)

To mitigate the vulnerabilities exposed by LSA-Probe and ASA, Layer-wise Adversarial Patch Training (LAPT) is introduced as a targeted fine-tuning protocol. The defense operates as follows (Gu et al., 19 Jun 2025):

  • For each training sample (x,y)(x, y) and each “peak” fragile layer (as pre-identified by LSA-Probe), inject fresh random normalized perturbations during each forward pass.
  • Compute the cross-entropy loss over the perturbed logits:

LLAPT(θ)=t=1ylogπθ(ytx,y<t;δ)\mathcal{L}_{\mathrm{LAPT}(\theta)} = -\sum_{t=1}^{|y|}\log\,\pi_\theta(y_t \mid x, y_{<t}; \delta)

  • Perform gradient descent updates to all parameters θ\theta, encouraging the model to produce the safe output even under latent disturbances.
  • Optionally, interpolate the fine-tuned and original weights to recover any small downstream performance loss.

This regime reduces attack success rates significantly without degrading standard evaluation benchmarks such as GSM8K and CommonsenseQA.

5. Empirical Evaluation and Quantitative Findings

Extensive experiments validate the efficacy and insights of LSA-Probe methodology across multiple model classes.

For LLM safety alignment (Gu et al., 19 Jun 2025):

  • Tested on 12 open LLMs (1.5B–70B parameters) using ASABench (4,862 adversarial cases).
  • Attack metrics:
    • MASR (Max-layer Attack Success Rate): 0.9–1.0 for ASA_random; ≈1.0 for ASA_grad.
    • PASR (Peak-layer Attack Success Rate): 0.4–0.7 (ASA_random), boosted by 0.2–0.4 with ASA_grad.
  • Defensive retraining with LAPT on the top-3 fragile layers reduces PASR by 0.14–0.35 on average, with <0.05 drop in general task accuracy.

For diffusion model membership inference (Liu et al., 2 Feb 2026):

  • Evaluated on MusicLDM and DiffWave using MAESTRO and FMA datasets.
  • LSA-Probe outperformed loss-based and trajectory-based MIA baselines in low false-positive rate regimes (TPR@1% and AUC improvements up to +0.08 and +0.04, respectively).
  • Geometric interpretation: training members reside in “flatter” regions (lower local Jacobian norm), necessitating a larger perturbation budget to cross a fixed perceptual degradation threshold—a finding borne out in the empirical gap in measured CadvC_{\text{adv}}.

6. Geometric and Dynamical Insights

LSA-Probe exposes the local topography of model response surfaces in latent space. In diffusion models, the relationship between the required adversarial budget and the local Jacobian norm of the reverse mapping is made explicit:

Cadv(x0;t,τ)τσtJRt(xt)C_{\rm adv}(x_0;t,\tau) \approx \frac{\tau}{\sigma_t \|J_{R_t}(x_t)\|}

where “members” have lower JRtJ_{R_t}, making them more stable and thus harder to perturb beyond the threshold. For LLMs, findings show that fragile directions often correspond to under-trained or insufficiently “aligned” activation subspaces; small perturbations along these axes can unlock latent unsafe behaviors, often in a layer-specific and direction-specific manner.

Latent-state geometry analyses (Chia et al., 12 Mar 2025) further reveal that safe and jailbroken activations concentrate in distinct clusters (“attractors”) in low-dimensional projections, and that linear perturbations can drive transitions between these basins.

7. Broader Applications and Implications

LSA-Probe serves as a model-agnostic diagnostic, attack, and defense toolkit:

  • Red-teaming: Probing and surfacing latent vulnerabilities undetectable by input-level perturbations.
  • Membership inference and forensics: Discriminating between members and non-members in generative models via local stability metrics.
  • Safety auditing and training-time defenses: Proactive identification and patching of latent subspaces vulnerable to adversarial manipulation.
  • Theoretical advancement: Bridges empirical findings with geometric and dynamical systems perspectives, connecting attractor basin transitions to adversarial state manipulation.

A plausible implication is that current surface-level behavioral alignment and privacy protocols may be insufficient unless complemented by representation-level robustness, as LSA-Probe exposes and quantifies vulnerabilities that input-level checks cannot anticipate (Gu et al., 19 Jun 2025, Chia et al., 12 Mar 2025, Liu et al., 2 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Stability Adversarial Probe (LSA-Probe).