Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Published 30 Apr 2026 in cs.CR and cs.AI | (2604.28129v1)

Abstract: Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.

Abstract PDF Upgrade to Chat

Authors (1)

Prashant Kulkarni

Summary

The paper presents a novel latent adversarial detection approach that leverages activation trajectory features to reveal multi-turn attack dynamics.
It combines a contrastive MLP for style-invariant embedding with XGBoost classification, boosting detection from 76.2% to 93.8% at 3.5% false positives.
Rigorous experiments on synthetic and real-world datasets demonstrate robustness and the necessity for model-specific probe training.

Latent Adversarial Detection: Activation Trajectory Probing for Multi-Turn Attack Detection

Introduction and Motivation

Latent Adversarial Detection (LAD) introduces an activation-space paradigm for multi-turn prompt injection and jailbreak detection in LLMs, distinguishing itself from conventional text-level defenses. The core insight leveraged is that multi-stage attack paths (trust-building, pivoting, escalation) are reflected not only at the surface (token) level, but are processually encoded in the geometric structure of the LLM’s internal state evolution across conversation turns—captured as elevated cumulative path length and irregular trajectories in the model’s residual stream, referred to as adversarial restlessness.

Figure 1: The LAD two-stage pipeline. Stage 1: a contrastive MLP projects raw activations ( $d=5120$ ) into a 128-dim style-invariant embedding. Stage 2: XGBoost classifies the embedding concatenated with 5 trajectory scalars (133 features).

Text-based detection (pattern matching, perplexity, input screening) is shown to be brittle both in terms of generalization and robustness to surface-level paraphrase. Instead, LAD probes activations at turn boundaries to detect adversarial drift—a signature invariant to how an attack is textualized but sensitive to the underlying steering of the model.

Dataset Construction and Labeling Paradigm

A central technical contribution is the construction of a synthetic multi-turn dataset containing turn-level three-phase annotations: benign, pivoting, and adversarial. Unlike existing benchmarks (LMSYS-Chat-1M, SafeDialBench, MHJ), this labeling allows probes to target the covert pivoting (steering) phase, supporting detection before explicit harmful requests.

Figure 2: Turn-level label comparison. Left: Synthetic adversarial conversations exhibit benign→pivoting→adversarial escalation. Right: First adversarial turn occurs late (~81%) in synthetic, but much earlier (~26%) in LMSYS, providing complementary generalization.

The synthetic dataset, sampled from Qwen3-235B, is augmented with real-world (LMSYS-Chat-1M) and diverse attacks (SafeDialBench) providing a combined data source for training and rigorous source-ablation evaluation, essential for establishing minimal conditions for practical deployment.

Activation Feature Engineering and Probe Architecture

Activations are extracted as last-token hidden states at a target intermediate-to-late layer ( $d=5120\text{--}8192$ ) for each user turn. Five scalar features are computed to summarize the trajectory dynamics:

Drift magnitude
Cosine similarity between sequential activations
Cumulative drift (total path length)
Drift acceleration
Mean drift

Layer ablation confirms the approach is not sensitive to layer choice when using trajectory features. These scalars are concatenated either with the raw activations or with a learned contrastive embedding (via a 2-layer MLP, producing a style-invariant 128d representation), then classified with XGBoost.

Figure 3: Training architecture. Contrastive MLP is trained on (synthetic + real-world) pairs; XGBoost classifies the frozen embedding + scalars. No GPU needed at probe training/retraining time.

Experimental Results and Analysis

Multi-Turn Detection and Early Warning

On synthetic held-out (797 conversations, Gemma 3 27B), adding trajectory scalars boosts conversation-level detection from 76.2% (activation only) to 93.8% at 3.5% FP. Scalars alone yield 89.6% detection but cannot achieve low FP—activations provide precision.

Figure 4: Trajectory traces: Probability of adversarial intent rises sharply proximate to the adversarial phase; benign trajectories remain near zero.

Early attack detection is highly sensitive to the length and clarity of the pivoting (steering) phase and is impossible to achieve with binary (safe/unsafe) turn or conversation labels. With extended pivoting (mean 6.7 turns), early detection rises to 66–83% (mean lead time +1.2–1.6 turns), compared to only 22–26% (lead time +0.1–0.3) for unextended data.

Figure 5: Early detection rate rises with pivoting-phase length. Models offer strong early detection only when trained with sufficient granularity.

Generalization, Data Requirements, and Robustness

Cross-model replication on four 24B–70B LLMs demonstrates high within-architecture reliability (89–96% detection, 0.5–2.0% FP), while cross-model scalar-only transfer fails (off-diagonal F1 ≈ 50%), necessitating model-specific data and probe fitting.

Figure 6: Detection generalizes across all four major LLM families when probes are trained per-model; no effective transfer without this calibration.

Leave-one-source-out ablations exhibit catastrophic breakdown on the held-out distribution, establishing that synthetic, LMSYS, and SafeDialBench data each contribute essential attack/benign patterns for deployment-quality generalization. Importantly, three-phase turn-level labels are critical—binary conversation-level labels yield degenerate probes with 50–59% FP.

Figure 7: Feature ablation confirms distributed contribution—no single scalar feature dominates, and activations are critical for low FP.

Probe performance surpasses off-the-shelf text-based tools (PromptGuard, LLM Guard, Lakera), realizing up to $14.9\times$ higher selectivity for the pivoting (steering) phase, critical for early intervention.

Figure 8: LAD achieves superior true positive and dramatically lowered FP rates relative to strong baselines.

Adversarial robustness analysis demonstrates that fully evading the probe would require an attacker to suppress 80–90% of the activation drift, which is operationally equivalent to not steering the model off the benign manifold at all.

Figure 9: Even with targeted drift suppression, achieving undetectable attacks requires nearly nullifying any steering of the LLM, undermining the adversarial objective itself.

Mechanistic Explanations and Theoretical Implications

LAD establishes that adversarial intent is encoded in the process of steering the LLM’s activations through high-dimensional space. The discriminative signal rests on trajectory dynamics—cumulative path length, drift acceleration—rather than on any localizable rank-1 content direction or interpretable SAE feature.

SAE ablations further indicate that no individual content feature is load-bearing; classification is robust to large-scale content-ablating interventions, supporting the orthogonality of trajectory- and meaning-based signals.

Figure 10: Ablating top-K SAE features has negligible effect; detection exploits distributed trajectory dynamics rather than content features.

Deployment, Adaptation, and Continuous Learning

A practical deployment architecture is outlined where the probe is trained per-model, supports rapid CPU re-fitting, and adapts continually as production labels accumulate and distributions shift. Critical operational requirements include white-box model access, activation caching, class rebalancing for reducing FP, and automated or human-in-the-loop labeling for drift tracking.

This sets the groundwork for a unified detection and conditional steering architecture based on low-rank subspace interventions, closing the loop between mechanistic detection and real-time model correction.

Limitations and Future Directions

The requirement for white-box access precludes application in closed or hosted LLMs. Cold-start performance depends on early availability of labeled in-distribution data; probes are not transferable across model architectures. Mechanistic interpretability remains at the aggregate process level—future work should resolve the specific circuits implicated, and refine the approach to include rank- $r$ subspace steering.

Conclusion

Latent Adversarial Detection demonstrates that activation-space trajectory features—e.g., cumulative drift—offer a robust, theoretically well-motivated mechanism for detecting multi-turn prompting attacks in LLMs, outperforming text-based systems, and offering natural support for early intervention. The approach establishes strong baselines for dataset curation and label granularity, and clarifies critical requirements for deployment in dynamic adversarial environments. This work also exposes a promising research frontier at the intersection of mechanistic interpretability, adversarial robustness, and adaptive, process-level LLM defense.

Figure 11: Example gradual escalation attack: trust-building (green), subtle steering (yellow), and overt adversarial prompts (red) identified by phase labels and flagged by the probe, often before harmful content appears.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about spotting sneaky, multi-step attacks on chatbots (like LLMs, or LLMs) by watching how the model’s “inside signals” change during a conversation. Instead of judging only the words a person types, the authors read the model’s internal “thought patterns” and look for a special kind of movement they call adversarial restlessness. This restlessness shows up when an attacker slowly steers the model from normal talk to harmful actions over several turns.

What questions did the researchers ask?

They focused on simple, practical questions:

Can we detect a multi-turn attack early, even if each message looks harmless on its own?
Is there a reliable “signature” inside the model that appears during an attack?
Can a small set of easy-to-measure numbers from the model’s internal state catch these attacks?
Will this work across different model sizes and families?
What kind of training data and labels are needed to make this reliable in the real world?

How did they study it? (Methods in everyday language)

Think of a conversation with a chatbot as a path on a map. After each turn (each user message), the model has an internal “snapshot” of what it understands so far. These snapshots are called activations. Stringing them together over time makes a path—an activation trajectory.

In normal chats, this path is short and smooth. In multi-step attacks, the path wiggles and stretches as the attacker guides the model through phases: building trust, pivoting (steering), then escalating to something harmful. The authors measure how “far” and how “restless” that path becomes.

From each turn, they read the model’s internal snapshot and compute five simple numbers (like a fitness tracker for the model’s brain):

How big was the change since the last turn? (drift)
How similar is the new state to the previous one? (similarity)
How much total change has built up so far? (total distance traveled)
Is the amount of change speeding up or slowing down? (acceleration)
On average, how big have the changes been up to now? (mean drift)

They then train a lightweight detector (think of it like a smart alarm) to flag a conversation if these numbers—and the current internal snapshot—look like an attack path rather than a normal chat. They test this on:

A carefully built synthetic dataset that labels each turn as benign, pivoting, or adversarial (this is rare and very helpful).
Real conversations from the wild (where labels are messier).
A mix of sources to see what kind of training best matches real use.

They also check multiple model families (e.g., different brands/sizes) to see what carries over.

What did they find, and why does it matter?

Here are the most important takeaways:

Adversarial restlessness is real: Attack conversations make the model’s internal path much longer and more jittery than normal conversations. Measuring the path’s shape (those five numbers) works well.
Big detection boost with small effort: Adding just those five numbers to the internal snapshot raises detection on synthetic tests from about 76% to about 94%, while keeping false alarms low.
Early warning is possible: When attacks include a longer “pivoting” phase (the subtle steering-before-harm), the detector often flags the problem a turn or more before the harmful request. Early alerts give time to intervene.
Works across model families, but not plug-and-play: The same idea works on different big models (24B–70B parameters), but each model still needs its own trained detector. A detector trained for one architecture doesn’t simply transfer to another.
Real-world data matters: Training only on synthetic data performs poorly on real chats (too many false alarms). Mixing three sources (synthetic, a real-world chat dataset, and a curated safety benchmark) brought detection to about 89% with only ~2–4% false positives on a combined held-out test set.
Good labels are key: Turn-by-turn labels with three phases (benign, pivoting, adversarial) are crucial. If you only label whole conversations as good or bad, false positives shoot up (50–59%). Fine-grained labels help the detector learn what early steering looks like.
Better than common text-only filters: Off-the-shelf text defenses either miss multi-turn attacks or flag too much. This activation-based method strikes a better balance—especially at catching the subtle pivoting phase without crying wolf.
Harder to evade: To slip past the detector, an attacker would have to hide most of the internal drift (around 80–90%), which is challenging since it’s tied to the model’s evolving understanding, not just the words used.

Why it matters: Today’s attackers don’t always ask for something bad right away—they warm up the model first. Spotting that warm-up by tracking the model’s inner changes gives defenders an early, more reliable warning system.

What’s the bigger picture? (Implications and impact)

A new layer of defense: Instead of filtering only the text, watch the model’s internal state. This helps catch attacks that “sound” harmless but feel wrong to the model.
Early intervention: If you can detect the pivoting phase, you can step in sooner—warn a human, block the next step, or even gently steer the model back to safe behavior.
Practical deployment path: The detector is lightweight once activations are recorded, and it can be retrained quickly as new examples arrive. But it needs:
- White-box access (you must be able to read the model’s internal signals).
- Per-model training (detectors don’t automatically transfer between model architectures).
- Good, phase-level labels (especially to learn what early steering looks like).
- Some real-world data from your own environment to keep false positives low.
Complementary to text tools: Activation-based detection and text-based filters can work together. Text tools flag suspicious words; activation tools flag suspicious steering. Together, they are stronger.

In short, the paper shows that multi-turn jailbreaks leave a clear, measurable “restlessness” inside a model. By tracking how the model’s internal state moves across turns, a small set of simple numbers can reliably spot attacks early, across different models, and with few false alarms—provided you train with the right kind of data and labels.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of unresolved issues that are missing, uncertain, or left unexplored in the paper. Each item is phrased to be concrete and actionable for follow‑up research.

White‑box requirement: How to adapt LAD to black‑box/API models where residual stream access is unavailable; can proxy signals (e.g., logit lens, external embeddings, response‑side activations) approximate the activation trajectory features?
Model‑specificity and transfer: Why do scalar dynamics and probes fail to transfer across architectures, and can alignment techniques (e.g., canonical correlation alignment, Procrustes, or subspace matching across layers) enable cross‑model or cross‑version transfer?
Layer and token-position dependence: The paper uses a single middle‑to‑late layer and the last token; how do trajectory signals vary across layers, multiple token positions, and pooled representations, and is there an optimal aggregation strategy that improves robustness?
Chat template and system‑prompt confounds: To what extent do different chat templates, system prompts, and role tokens inflate $C_t$ or $\lVert \Delta_t \rVert$ ; can template‑invariant preprocessing or normalization mitigate this?
Length and topic drift normalization: Beyond removing absolute turn index, what length‑ or topic‑aware normalizers (e.g., per‑topic baselines, turn‑wise z‑scoring) reduce false positives from benign but exploratory or multi‑topic dialogues?
Intent‑isolated shift not implemented: The proposed “intent‑isolated shift” $\Delta_t^{\text{intent}}=\Delta_t^{\text{attack}}-\Delta_t^{\text{control}}$ requires topic‑matched controls; how to construct scalable, high‑coverage benign controls and does this materially reduce content confounds and FPs?
Generalization beyond LMSYS/SafeDialBench: How does LAD perform on additional real‑world, multi‑turn datasets (e.g., MHJ, enterprise logs, multilingual forums), especially outside English and security topics?
Multilingual and code settings: Does adversarial restlessness hold in non‑English languages, mixed‑language chats, and code‑heavy sessions; are language/tokenization differences a confound for trajectory features?
Tool use and multi‑modal contexts: How does LAD operate when attacks route through tools, function calls, or non‑text modalities (images, audio); where should activations be sampled in these pipelines?
Single‑turn and very‑short pivoting attacks: The method is optimized for multi‑turn attacks; what is the detection power for single‑turn jailbreaks or attacks with one‑turn pivoting, and can additional features address this gap?
Probe‑aware adversaries: Robustness is only evaluated via synthetic drift suppression; how does LAD fare against attackers explicitly optimizing to minimize trajectory features (e.g., gradient‑based activation steering, RLHF‑style training to evade the probe)?
Evasion vs. utility trade‑off: What is the minimal drift attenuation required for an attacker to evade detection while still achieving jailbreak success, measured on real models with probe‑aware generation?
Error analysis on real data: Which real‑world attack types are missed or cause false alarms; provide qualitative analyses of failure cases to identify systematic blind spots (e.g., heated but benign debates, rapid topic shifts).
Calibration and thresholding: Detection uses a fixed $\theta{=}0.5$ ; how do ROC/PR trade‑offs vary across deployments, and can cost‑sensitive or calibrated thresholds improve early detection at acceptable FP rates?
Sequential modeling vs. scalar augmentation: The paper avoids RNNs/GRUs on activations; does a lightweight temporal model over per‑turn activations (or over the 5 scalars) capture longer‑range dependencies and reduce misses without large latency costs?
Sample‑efficiency and cold‑start: What are the learning curves for per‑model deployment—how many labeled conversations are required to reach target FP/detection rates, and can active learning or weak supervision reduce this burden?
Label quality for LMSYS: Per‑turn labels derive from OpenAI moderation flags (no pivoting); quantify label noise and its effect on training, and explore weak‑labeling heuristics or annotator workflows to obtain pivoting labels in the wild.
Stability to model updates: How sensitive are probes to model fine‑tuning, quantization, or provider updates; can unsupervised drift monitoring on activation distributions trigger timely retraining and maintain performance?
Mechanistic localization: Which attention heads/MLP blocks causally drive adversarial restlessness; can circuit tracing and SAE decompositions attribute $C_t$ increases to interpretable features and validate causal interventions?
Detection-to-intervention pipeline: The paper proposes steering as future work; does coupling LAD with conditional activation steering reduce harmful outputs while preserving benign utility, and what are the side‑effects (over‑refusal, latency)?
Rank‑ $r$ subspace defense: The proposed low‑rank subspace approach is untested; what $r$ and training objectives best capture multi‑directional attack structure, and can this subspace support both detection and corrective steering reliably?
Deployment latency and cost: What is the runtime overhead of activation extraction and classification in production (per‑token/turn latency, memory footprint), and how does this scale with context length and model size ( $\geq$ 70B)?
Privacy and governance of activation logs: What policies and technical safeguards (e.g., on‑device caching, differential privacy) are needed to store/use activations for retraining without exposing sensitive user content?
Benchmark breadth and reproducibility: Current held‑out sets are relatively small and code is gated; larger, diverse, and openly replicable benchmarks—with standardized metrics for lead time and phase selectivity—are needed to validate claims.
Template/domain portability: Does a probe trained on one domain/template (e.g., customer support) transfer to others (education, healthcare), or is per‑domain fine‑tuning necessary to avoid FPs from domain‑specific discourse patterns?
Tokenization and position‑encoding effects: Do differences in tokenization schemes and positional embeddings across models influence trajectory geometry in ways that confound the features; can feature engineering adjust for this?
Mixed benign hard cases: Which benign scenarios (e.g., security education, red‑team simulations, exploratory planning) remain high‑FP after mixed training, and can curated hard‑negative mining systematically reduce these errors?
Long‑context behavior: The study truncates to 4,096 tokens; do very long contexts ( $\gg$ 8k) cause accumulation of benign drift that erodes specificity, and are sliding‑window or decay‑weighted features necessary?
Safety/ethics of scalar-only alarms: Scalars alone show high detection but high FP; can we design deployment‑grade triage that uses scalars for early warning and gated activation features for confirmation to balance cost and precision?
Cross‑family synthetic generation bias: Synthetic data is generated with a large Qwen model; does using the same or related families bias activation trajectories and inflate performance on Qwen‑like architectures?
Multisource mixing strategies: Leave‑one‑source‑out shows each source matters; what mixing ratios, curriculum schedules, or domain‑adaptive training minimize FPs while preserving generalization to new sources?
Tool‑interaction boundaries: Activations are extracted at user turns; for agent/tool frameworks, which boundaries (pre‑tool call, post‑tool result, planner steps) yield the most discriminative trajectory features?
Safety‑critical calibration: How to set conservative thresholds for high‑stakes applications (e.g., tool control) that guarantee a minimum lead time while bounding task degradation on benign conversations?

View Paper Prompt View All Prompts

Practical Applications

Below is a focused synthesis of practical, real-world applications that follow from the paper’s findings, methods, and innovations (activation-level, multi-turn detection via “adversarial restlessness”). Each entry names target sectors, sketches potential tools/workflows, and notes key assumptions or dependencies.

Immediate Applications

Activation-level guardrail for self-hosted LLMs and agents
- Sectors: software, cybersecurity, enterprise IT, DevOps
- What: Integrate LAD-style probes (XGBoost + five trajectory scalars with optional contrastive encoder) into the model server to continuously score multi-turn conversations and block or sandbox risky turns (especially during pivoting phases).
- Tools/workflows: “LAD Probe SDK” as a server-side hook on a mid-to-late decoder layer; a thresholded policy that disables tool calls (e.g., code execution, shell, database) when P(adv) > θ; audit logging for flagged turns.
- Assumptions/dependencies: White-box access to the target model; per-model training on representative deployment data (synthetic + in-distribution logs); safe storage for cached activations; minimal latency overhead from hooks.
Early gating of tool/API calls in agent frameworks
- Sectors: autonomous agents, MLOps, RPA
- What: Use activation trajectory scores to gate high-impact actions (tool use, file I/O, network, finance ops) before the overt malicious step.
- Tools/workflows: “Tool Gating Middleware” that checks trajectory score before action planners execute; deny/allow lists and cool-off timers when pivoting is detected.
- Assumptions/dependencies: Accurate turn boundaries (chat template consistency); per-agent-model probes; policy design for false positives (e.g., human-in-the-loop override).
RAG and browsing protection against prompt injection
- Sectors: software, search, product support, documentation portals
- What: Detect restlessness increases after ingesting external content, indicating context poisoning or indirect injection; quarantine suspect chunks or switch to safer retrieval.
- Tools/workflows: “Retriever Guard” that computes delta features pre-/post-chunk insertion; on trigger, drop or re-rank sources, strip instructions from retrieved text, or fall back to summarization-only mode.
- Assumptions/dependencies: White-box model control within the RAG stack; per-domain calibration (e.g., technical vs consumer content); mixed training data to control false positives.
Customer support and enterprise chatbots: selective escalation
- Sectors: customer service, HR IT helpdesks, internal knowledge assistants
- What: Trigger human takeover when the conversation’s trajectory indicates pivoting/escalation, reducing both jailbreak risk and agent misuse.
- Tools/workflows: “Trajectory Monitor Dashboard” in the agent console; automatic escalation tickets with activation snapshots.
- Assumptions/dependencies: Labeled benign examples from real deployments (to keep FP ≈ 2–4%); privacy controls for activation telemetry.
SOC/SIEM integration for LLM security monitoring
- Sectors: cybersecurity, finance, healthcare (regulated environments)
- What: Treat activation-drift spikes as security signals; forward to SIEM for correlation with other events (e.g., anomalous tool usage, DLP triggers).
- Tools/workflows: “Activation Telemetry Connector” exporting per-turn scalar features and alert states; SOC playbooks for containment (disabling high-risk tools/accounts).
- Assumptions/dependencies: Governance for storing internal model states; mapping alerts to incident severity; staff training on restlessness semantics.
Red-teaming and model evaluation at deployment
- Sectors: model vendors, platform providers, enterprise AI teams
- What: Use the provided multi-turn datasets (including pivoting labels) to stress-test models and measure phase-selective detection under realistic attack distributions.
- Tools/workflows: “Probe Training Harness” using cached activations for fast CPU-only retraining; leave-one-source-out evaluations to avoid overfitting to synthetic patterns.
- Assumptions/dependencies: Access to synthetic + internal logs for coverage; strict separation of evaluation from training data; gated access to released datasets.
Complementary filter to text-level defenses
- Sectors: any LLM-integrated product
- What: Combine activation-level probes with existing text scanners to cut down false positives (text filters catch topical risks; LAD catches intent drift).
- Tools/workflows: Ensemble routing—only trigger blocks when activation-level risk stays elevated over k turns; separate thresholds for “pivoting” vs “adversarial” phases.
- Assumptions/dependencies: Threshold calibration per domain; tracking conversation state across retries and re-prompts.
On-device or air-gapped deployments with open models
- Sectors: defense, critical infrastructure, privacy-sensitive enterprises
- What: Run probes entirely within secure boundaries (no cloud calls) to enforce policy in constrained/regulated environments.
- Tools/workflows: “Edge LAD Module” compiled with the model server; secure ring buffer for activation scalars; periodic offline retraining on cached data.
- Assumptions/dependencies: Sufficient memory/latency budget; compliance review for activation logging even on-device.

Long-Term Applications

Automatic intervention via activation steering
- Sectors: software, agents, cybersecurity
- What: When restlessness spikes, apply low-rank steering (e.g., rank-r subspace projection) to push activations back toward a safe manifold, not just detect.
- Tools/products: “Activation Guard + Steering” that couples detection to conditional steering policies; closed-loop guardrails.
- Assumptions/dependencies: Robust identification of safe subspaces; careful evaluation to prevent over-refusal and preserve utility.
Provider-level “Activation Guard” APIs and standards
- Sectors: model platforms, cloud providers
- What: Standardized, audited APIs exposing safe, privacy-aware activation telemetry or derived scalars for defense tools.
- Tools/products: Cloud-native “Activation Telemetry” with opt-in controls; standardized schema for cumulative drift, acceleration, and risk scores.
- Assumptions/dependencies: Industry consensus on privacy and IP boundaries; performance and cost management; regulator guidance.
Cross-model transfer and universal encoders
- Sectors: academia, model vendors
- What: Research to build contrastive encoders or subspace alignments that transfer restlessness detection across architectures without per-model retraining.
- Tools/products: “Universal Trajectory Encoder” services; foundation probes that fine-tune with minimal data.
- Assumptions/dependencies: Larger multi-source, multi-model datasets; alignment across varying residual geometries; overcoming current poor off-diagonal transfer.
Black-box-aligned variants for closed APIs
- Sectors: SaaS users of proprietary LLMs
- What: Combine external embeddings and stateful detectors (e.g., DeepContext-like) with calibrated proxies for activation dynamics to approximate restlessness signals.
- Tools/products: “Stateful Semantic Drift Guard” packaged as a middleware for API-based apps; hybrid ensembles with rate-limited, model-provided metadata if available.
- Assumptions/dependencies: Cooperation from providers or reliable proxies; potentially reduced precision vs white-box LAD.
Domain-specialized detectors (healthcare, finance, ICS)
- Sectors: healthcare, finance, energy/ICS, legal
- What: Train per-sector probes on in-distribution dialogs to distinguish benign sensitive use (e.g., medical triage) from adversarial steering (e.g., policy evasion, fraud).
- Tools/products: “Sector LAD Packs” with curated datasets and thresholds; integration with compliance workflows (e.g., HIPAA, SOX).
- Assumptions/dependencies: High-quality, privacy-compliant labeled logs; careful FP control to avoid blocking legitimate expert queries.
Robust agent ecosystems with restlessness-aware planners
- Sectors: robotics, RPA, autonomous operations
- What: Integrate trajectory risk as a first-class signal in planners and tool routers (e.g., block physical actuation or financial transfers if risk spikes mid-plan).
- Tools/products: “Risk-Aware Planning SDK” that conditions search/act loops on activation drift; safe fallback plans.
- Assumptions/dependencies: Real-time constraints in robotics/ICS; rigorous safety validation; human oversight pathways.
Regulation and assurance frameworks for AI deployments
- Sectors: policy, risk/compliance, regulated industries
- What: Include trajectory-based, multi-turn detection capabilities in assurance standards; require monitoring and incident reporting for activation-drift anomalies.
- Tools/products: “Restlessness Audit Reports” as part of model assurance; certification checklists for multi-turn defense.
- Assumptions/dependencies: Standard-setting bodies’ buy-in; documented governance for activation data retention and privacy.
Privacy-preserving activation telemetry
- Sectors: policy, platform providers, enterprise IT
- What: Differential privacy or secure enclave processing for activation features to enable monitoring without exposing sensitive internal states.
- Tools/products: “Private Probe Runtime” that emits only aggregated scalars with privacy budgets; enclave-isolated retraining.
- Assumptions/dependencies: Formal privacy guarantees; performance overhead acceptance.
Training-time alignment using restlessness as a loss signal
- Sectors: model vendors, research labs
- What: Penalize excessive activation path length during fine-tuning/RLHF for benign scenarios to reduce spurious restlessness and improve downstream detector precision.
- Tools/products: “Trajectory-Regularized Fine-Tuning” recipes; benchmarks tying restlessness to safety outcomes.
- Assumptions/dependencies: Avoid collapse into over-refusal; careful multi-objective tuning (helpfulness vs safety).
Hardware/runtime optimizations for streaming trajectory features
- Sectors: hardware vendors, edge AI, hyperscalers
- What: Add lightweight kernels for extracting and aggregating last-token residuals and scalars with minimal latency.
- Tools/products: “Trajectory Feature Cores” or fused ops in inference engines; observability hooks at tensor runtime.
- Assumptions/dependencies: Ecosystem-wide support in inference stacks; measurable ROI on latency/security trade-offs.

Notes on feasibility and deployment dependencies common across applications:

White-box access is currently required for LAD as presented; probes are model-specific and must be trained on representative, in-distribution conversations. Three-phase turn-level labels (benign/pivoting/adversarial) substantially reduce false positives; in practice, this often means bootstrapping with synthetic pivoting data plus real benign/adversarial logs.
Activation logging and caching enable fast iteration but introduce privacy and governance considerations; organizations need policies for secure storage, retention, and audit.
Thresholds and policies should be tuned to the domain’s risk tolerance; human-in-the-loop escalation is recommended for high-stakes actions.
Generalization across attack distributions is source-dependent; leave-one-source-out failures emphasize the need for continual adaptation and monitoring for distribution shift.

View Paper Prompt View All Prompts

Glossary

Activation drift: A change in a model’s internal representation across turns reflecting shifts in its understanding. "LAD detects activation drift in what the model understood, enabling detection of attacks where surface text appears benign but internal representations shift."
Activation-level signature: A consistent pattern in internal activations indicative of a phenomenon (e.g., adversarial intent). "We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations."
Activation steering: Intentionally modifying activations at inference to alter model behavior. "Early pivoting-phase detection enables activation steering~\cite{zou2023representation,li2023iti} on the model's next response—shifting representations away from the adversarial manifold before the attack lands."
Activation trajectory: The sequence of activation vectors over conversation turns viewed as a path in activation space. "From the activation trajectory, we derive five scalar features:"
Adversarial restlessness: Elevated cumulative activation movement caused by multi-phase attacks (trust-building, pivoting, escalation). "These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment."
BF16: Brain floating point 16-bit format used for efficient activation extraction. "All activations are extracted in BF16 and cast to FP32 for numerical stability (BF16 $\to$ FP16 casting causes overflow to inf in hook outputs)."
BERT embeddings: Text representations from the BERT model used for downstream classification. "a GRU over fine-tuned BERT embeddings that tracks ``intent drift'' across turns"
bootstrap 95% CI: A nonparametric confidence interval estimated via bootstrapping. "Combined held-out evaluation ( $n$ =1{,}797, bootstrap 95\% CI)."
CAST framework: A system for conditional activation steering to modulate model refusals. "The CAST framework \cite{lee2025cast} enables conditional steering for selective refusal."
Chat template: A formatting scheme for multi-turn prompts when feeding text to chat-tuned models. "extract at each user turn boundary after feeding the cumulative context via the model's chat template:"
Circuit tracing: Methods to identify functional pathways in a model’s computation. "Mechanistic interpretability is now practical: SAEs \cite{templeton2024scaling}, circuit tracing \cite{ameisen2025circuit}, and feature steering are deployable—but not yet applied to cybersecurity threats."
Cold-start period: The initial deployment window requiring new labeled data before reliable performance. "deployment requires a cold-start period of labeled data per target model."
Conditional steering: Applying activation steering conditioned on context or intent. "The CAST framework \cite{lee2025cast} enables conditional steering for selective refusal."
Contrastive encoder: A learned projection that maps activations into an intent-focused space by pulling together same-intent examples and pushing apart different-intent ones. "where $f_\phi$ is the frozen contrastive encoder."
Contrastive MLP: A multilayer perceptron trained with contrastive objectives to produce style-invariant embeddings. "Stage~1: a contrastive MLP projects raw activations ( $d{=}5{,}120$ ) into a 128-dim style-invariant embedding"
Cumulative drift: The total path length traversed by activation vectors up to a turn. "Trajectory scalars (cumulative drift, drift magnitude) are the top features in every model"
Cumulative path length: The aggregated distance traveled by the activation trajectory, elevated in attacks. "each phase shift moves the activation, producing elevated cumulative path length."
Decoder-only transformer: An architecture that generates outputs autoregressively without an encoder component. "Given a decoder-only transformer with $L$ layers, we hook the output of decoder layer $\ell$ "
Drift acceleration: The change in drift magnitude between successive turns. "Drift acceleration: $a_t = \|\Delta_t\| - \|\Delta_{t-1}\|$ "
Forward hook: A runtime attachment to capture activations during forward passes. "Activation-level defenses require only a forward hook on any self-hosted model—no special tooling or model cooperation needed."
GRU: Gated Recurrent Unit, a recurrent neural network component for sequential modeling. "a GRU over fine-tuned BERT embeddings that tracks ``intent drift'' across turns"
HACCA: Highly Autonomous Cyber-Capable Agents, a threat model for multi-stage cyber campaigns. "HACCAs \cite{iaps2025hacca} use models to conduct multi-stage cyber campaigns autonomously"
hard-negative methodology: Training with challenging non-adversarial examples that closely resemble positives to improve robustness. "to combine non-linear probes with novel-category evaluation, hard-negative methodology, multi-turn analysis, and cross-model comparison."
Inference-Time Intervention: Steering a model’s internal activations during inference to influence outputs. "Inference-Time Intervention \cite{li2023iti} steers activations for truthfulness."
Instruction-tuned: Models fine-tuned on instruction-following data for chat or assistant behavior. "All models use instruction-tuned variants for chat template support."
intent-isolated shift: The drift attributable to adversarial intent after subtracting content effects using a benign control. "the \emph{intent-isolated shift} can be computed as"
last-token extraction: Reading the hidden state at the final token position as a summary of current context. "last-token extraction (recency)"
leave-one-source-out: A validation method holding out one dataset source to test generalization. "leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions"
McNemar test: A statistical test for paired nominal data; used here to compare false positives. "LAD (89.5\%, 95\% CI: 87.5--91.5\%) achieves $14.9\times$ higher pivoting selectivity than Lakera Guard (McNemar FP: $p < 10^{-100}$ )."
Mechanistic interpretability: Techniques to understand how internal components implement behavior. "Mechanistic interpretability is now practical: SAEs \cite{templeton2024scaling}, circuit tracing \cite{ameisen2025circuit}, and feature steering are deployable"
off-diagonal F1: Cross-model transfer performance when training and testing on different architectures. "Off-diagonal F1 averages 50.4\% (near random)"
perplexity filters: Defenses that flag inputs with improbable token sequences under LLMs. "pattern matching, perplexity filters, classifier-based input screening, and multi-stage detection pipelines."
pivoting label: A turn-level annotation marking subtle steering before overtly harmful requests. "The pivoting label---absent from all existing multi-turn safety benchmarks---enables \emph{early detection} during the steering phase before overt attack"
rank-1 signal: Treating an activation vector as a one-dimensional direction for detection or control. "The current probe treats each $#1{v}_t \in R^d$ as a rank-1 signal."
rank- $r$ subspace: A low-dimensional subspace capturing multiple salient directions for detection/steering. "From rank-1 to rank- $r$ subspace defense."
Representation engineering: Manipulating and reading internal representations to elicit desired properties. "Representation engineering \cite{zou2023representation} established that internal representations can be read and steered for safety-relevant properties."
Residual length confound: A spurious correlation between sequence length and residual stream statistics that can mislead detection. "it introduces a residual length confound even with absolute indexing."
Residual stream: The sequence of residual vectors in a transformer that aggregate layer contributions. "leaves an activation-level signature in the model's residual stream"
SAEs (Sparse Autoencoders): Autoencoders with sparsity constraints used to extract interpretable features from model activations. "SAEs have scaled to frontier models \cite{templeton2024scaling,gao2024scaling}, though downstream task utility remains mixed."
Stiefel manifold: The space of orthonormal matrices used for optimizing low-rank subspaces. "via supervised optimization on the Stiefel manifold"
style-invariant embedding: A representation designed to cluster by intent regardless of conversation style. "into a 128-dim style-invariant embedding where same-intent turns cluster regardless of conversation style."
SVM-RBF: Support Vector Machine with a radial basis function kernel. "trains SVM-RBF classifiers on tensor decompositions"
tensor decompositions: Factorizations of high-dimensional tensors to extract structured components. "trains SVM-RBF classifiers on tensor decompositions"
Trajectory scalars: Summary statistics (e.g., drift, acceleration, path length) derived from activation changes across turns. "concatenated with 5 trajectory scalars (133 features)."
white-box access: Direct access to a model’s internal states and parameters for probing. "The defender has white-box access to the target model's internal activations"
XGBoost: An ensemble gradient boosting algorithm using decision trees. "classifies via XGBoost (300 trees, depth 6, StandardScaler, $\theta{=}0.5$ , no threshold tuning)"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Summary

Latent Adversarial Detection: Activation Trajectory Probing for Multi-Turn Attack Detection

Introduction and Motivation

Dataset Construction and Labeling Paradigm

Activation Feature Engineering and Probe Architecture

Experimental Results and Analysis

Multi-Turn Detection and Early Warning

Generalization, Data Requirements, and Robustness

Mechanistic Explanations and Theoretical Implications

Deployment, Adaptation, and Continuous Learning

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they study it? (Methods in everyday language)

What did they find, and why does it matter?

What’s the bigger picture? (Implications and impact)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Summary

Latent Adversarial Detection: Activation Trajectory Probing for Multi-Turn Attack Detection

Introduction and Motivation

Dataset Construction and Labeling Paradigm

Activation Feature Engineering and Probe Architecture

Experimental Results and Analysis

Multi-Turn Detection and Early Warning

Generalization, Data Requirements, and Robustness

Mechanistic Explanations and Theoretical Implications

Deployment, Adaptation, and Continuous Learning

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they study it? (Methods in everyday language)

What did they find, and why does it matter?

What’s the bigger picture? (Implications and impact)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research