Anomaly Detection in Agentic Trajectories
- Anomaly detection in agentic trajectories is the process of identifying deviations in sequences of actions, observations, or control signals in autonomous systems.
- The methodology integrates sequence modeling, density estimation, and feature engineering to capture semantic, structural, and contextual anomalies.
- Practical applications demonstrate high precision, real-time deployment, and enhanced safety across domains like robotics, transportation, and multi-agent systems.
Anomaly Detection in Agentic Trajectories
Anomaly detection in agentic trajectories focuses on identifying deviations from expected action or observation sequences generated by autonomous, semi-autonomous, or AI-augmented agents. Such trajectories can be instantiated as step-wise LLM-driven plans, control sequences for physical agents, or multi-modal spatio-temporal paths in structured or unstructured environments. The domain is characterized by multimodal failure modes—including semantic misalignment with task goals, structural incoherence, context-dependent breakdowns, and physically implausible behaviors—necessitating a convergence of sequence modeling, density estimation, feature engineering, and domain-specific priors.
1. Formalization and Taxonomy
In the agentic context, a trajectory is often represented as a sequence where encodes a discrete action, observation, or control signal. Recent definitions extend this to include ordered trees or call graphs in Multi-Agentic AI, parameterizing each step by actor identity, timestamp, and execution context (Pathak et al., 6 Nov 2025). Anomalies are observations or subsequences exhibiting statistically or semantically significant deviations from a reference distribution of normal behavior. The taxonomy includes:
- Semantic (task-level) anomaly: The sequence is not contextually appropriate for the specified task or goal (Advani, 2 Jan 2026).
- Structural anomaly: The trajectory is malformed, e.g., violates ordering or coherence constraints (Advani, 2 Jan 2026, Pathak et al., 6 Nov 2025).
- Point anomaly: A single time step or action deviates sharply from normal (Zorriassatine et al., 21 Apr 2026).
- Contextual anomaly: An aberration conditional on specific environmental or agent state (Zorriassatine et al., 21 Apr 2026, Hu et al., 2024).
- Collective anomaly: Long-term or subtle drift emerging over extended sequences (Zorriassatine et al., 21 Apr 2026, Rahman et al., 26 Mar 2026).
- Physically infeasible trajectory: Violates explicit physical or system constraints (Sharma et al., 8 Jun 2025).
Labeling regimes leverage rule-based, human-annotated, or fully unsupervised operationalizations, depending on the domain and availability of reference trajectories (Pathak et al., 6 Nov 2025, Ghoreishi et al., 2023).
2. Methodological Approaches
2.1 Representation Learning and Embeddings
Techniques for anomaly detection in agentic trajectories often commence with contextual or task-grounded embedding. In "Trajectory Guard," trajectory steps and the associated task description are encoded via all-MiniLM-L6-v2 and processed through parallel Siamese towers (shared and task-specific GRU/MLP heads) (Advani, 2 Jan 2026). In multimodal and multi-agent domains, segmentations (e.g., fixed-length windows for GRADINGS), graph-based embeddings, or hyperspectral trajectory images are utilized (Dias et al., 2020, Rahman et al., 26 Mar 2026, Mbuya et al., 22 Sep 2025).
2.2 Sequence and Graph Modeling
Recurrent architectures (GRU, LSTM), sequence autoencoders, Transformer-based encoder-decoders, and temporal convolutional networks capture stepwise dependencies, sequential validity, and time-local structure. For agentic plans, sequence-aware decoders or Siamese recurrent autoencoders permit the reconstruction-based assessment of ordering and composition (Advani, 2 Jan 2026, Wiederer et al., 2021, Lyu et al., 28 Jan 2026). Multi-agent and road-constrained environments leverage graph neural networks (GATs), graph-based positional encodings, and map-matching to enforce spatial and contextual consistency (Mbuya et al., 22 Sep 2025).
2.3 Density Estimation and Score Aggregation
Unsupervised detection frequently involves fitting a normal behavior density via normalizing flows, variational autoencoders, or kernel density estimators over latent representations (Dias et al., 2020, Wiederer et al., 2021, Hu et al., 2024). Segment-wise anomaly scores (e.g., negative log likelihoods or reconstruction errors) are aggregated per trajectory (mean, median, top-k, or max) to yield coherent anomaly judgments robust to variable-length sequences or subsequence outliers (Dias et al., 2020, Lyu et al., 28 Jan 2026). Hybrid objectives—combining contrastive losses (task-alignment) and reconstruction/autoencoding losses (structural validity)—dominate in action-sequence domains (Advani, 2 Jan 2026).
2.4 Hierarchical and Context-Aware Modeling
Hierarchical models factor trajectory generation into high-level intent transitions (subgoals) and low-level subtrajectory realization. In IHiD, intention congruence is enforced by Inverse Q-Learning over a Markov Decision Process of subgoals, while a diffusion model generates and scores subtrajectory reconstructions conditioned on subgoal pairs (Wang et al., 21 Sep 2025). Context-aware designs inject agent identity and semantic/geographic POI clusters into the encoding process, resulting in significantly higher anomaly detection precision and recall (Hu et al., 2024). In long-horizon and multi-modal observation regimes, Cyclic Factorized Transformers (CFT) leverage the dual periodicity of routines across intra-day and inter-day axes to induce structure in Hyperspectral Trajectory Images (Rahman et al., 26 Mar 2026).
2.5 Physics-Informed and Tool-Augmented Methods
Domain priors are injected through physics-based losses or kinematic regularization (e.g., enforcing 2D bicycle model dynamics in Pi-DPM) (Sharma et al., 8 Jun 2025), or explicit tool calls as in inspector-agent frameworks for visual anomaly detection (Tan et al., 20 May 2026). Physics-informed regularizers and tool-based feedback loops reduce false positives stemming from spurious deviations, ensuring that anomalies correspond to genuine violations rather than idiosyncratic sampling.
3. Benchmarks, Metrics, and Performance
Benchmarks span synthetic perturbations, real-world audits, human movement and industrial inspection, intelligent transportation (Porto taxis, Beijing GeoLife), and multi-agent simulated traffic (Advani, 2 Jan 2026, Pathak et al., 6 Nov 2025, Ghoreishi et al., 2023, Mbuya et al., 22 Sep 2025).
Key metrics include:
- Precision, recall, and F1-score for the anomaly class (balanced and imbalanced regimes).
- Area under Precision-Recall Curve (AUC-PR) and mean Intersection-over-Union (mIoU) in segmentation/localization settings (Rahman et al., 26 Mar 2026).
- Latency: inference time per trajectory/sample to assess suitability for production deployment (Advani, 2 Jan 2026).
- Physical risk alignment: rank correlation (Spearman/Kendall) between anomaly scores and surrogate safety measures (e.g., Time-to-Collision) (Lyu et al., 28 Jan 2026).
- Recall on imbalanced or external test benchmarks, since missed anomalies are frequently asymmetric in cost (Advani, 2 Jan 2026).
Empirical results show,
- F1-scores up to 0.94 and recall up to 0.92 on real-world agent trajectory fault datasets with Trajectory Guard (Advani, 2 Jan 2026).
- Pi-DPM achieving F1 = 0.98 on urban/maritime benchmarks, with large reductions in RMSE and false positive rates relative to VAE, DiffTraj, and other generative models (Sharma et al., 8 Jun 2025).
- IHiD marking up to +30 points F1 gain over state-of-the-art in route-switching or long-term detour anomalies via hierarchical modeling (Wang et al., 21 Sep 2025).
- TITAnD (HTI + CFT) combining agent-level AUC-PR = 0.98 with 11–75× computational speedup over plain Transformers on multi-month dense/sparse datasets (Rahman et al., 26 Mar 2026).
4. Practical Algorithms and Deployment
Table: Core methods and their distinguishing features
| Approach | Key Model/Mechanism | Notable Domain or Strength |
|---|---|---|
| Trajectory Guard (Advani, 2 Jan 2026) | Siamese GRU autoencoder + hybrid loss | LLM action sequences, real-time gating |
| GRADINGS (Dias et al., 2020) | Normalizing flows on trajectory windows | GPS trajectory scoring (AUROC 0.91) |
| Pi-DPM (Sharma et al., 8 Jun 2025) | Diffusion generative model + physics | Maritime/urban, low FPR, physical priors |
| IHiD (Wang et al., 21 Sep 2025) | Hierarchical: IQL + diffusion | Long-time, subgoal intent, complex anomaly |
| GETAD (Mbuya et al., 22 Sep 2025) | GAT + Transformer + confidence-weighted NLL | Road-constrained, spatial structure |
| TITAnD (Rahman et al., 26 Mar 2026) | HTI + Cyclic Factorized Transformer | Multi-month, sparse-dense unified, high AUC |
| IndusAgent (Tan et al., 20 May 2026) | LLM+visual with agentic tools + RL | Industrial inspection, tool aggregation |
Deployment considerations include (i) latency (e.g., 32 ms per sample for LLM action-sequence pipelines, enabling real-time gating (Advani, 2 Jan 2026)), (ii) scalability with batch inference and efficient vectorization (Advani, 2 Jan 2026, Rahman et al., 26 Mar 2026), and (iii) synchronization with agent planning/execution pipelines for pre-emptive anomaly retargeting or escalation to human-in-the-loop review (Tan et al., 20 May 2026). Generic pipelines consist of (1) embedding/contextualization, (2) sequence or graph/cyclic encoding, (3) loss-based or density-based scoring, and (4) thresholding calibrated on balanced or control datasets.
5. Failure Modes, Limitations, and Open Challenges
Current pipelines display strengths in explicit structure or density violations but may miss semantic or context-dependent anomalies (e.g., missing salient detail in Multi-Agentic LLM outputs when path statistics resemble normal traces (Pathak et al., 6 Nov 2025)). Path-dependent or subtly adversarial drifts often require feature enrichment, e.g., by contextual embeddings or semantic alignment metrics, to be detectable at scale. Domain transfer, online adaptation (e.g., for rapid concept drift), and scalable unsupervised methods remain open avenues for research (Pathak et al., 6 Nov 2025, Zorriassatine et al., 21 Apr 2026).
Reliance on graph embeddings or hand-engineered features may limit portability unless domain-specific topologies are encoded (e.g., in GETAD (Mbuya et al., 22 Sep 2025)). Labeling bottlenecks, especially in rare event settings (e.g., human falls or near-collisions (Zorriassatine et al., 21 Apr 2026, Lyu et al., 28 Jan 2026)), necessitate semi-supervised, unsupervised, or feedback-driven adaptation.
6. Emerging Directions and Future Scope
Emerging paradigms include:
- Structure-aware vision formulations (HTI) for cross-modal, cross-density transfer (Rahman et al., 26 Mar 2026).
- Agentic RL pipelines augmented with tool orchestration, enabling dynamic context assembly and targeted inspection (Tan et al., 20 May 2026).
- Physics-informed priors, hierarchical intent modeling, and attention factorization to achieve both sharp detection and scalable deployment (Sharma et al., 8 Jun 2025, Wang et al., 21 Sep 2025, Rahman et al., 26 Mar 2026).
- Online and sequential update architectures for real-time anomaly flagging and closed-loop safety management (Zorriassatine et al., 21 Apr 2026).
Generalization to new modalities (e.g., industrial, healthcare, networked agents), richer causal reasoning (via LLM augmentation or knowledge-base integration), and multiscale or heterogeneous agent populations constitutes a critical research frontier. A plausible implication is that unified architectures leveraging context, physical priors, and hybrid loss design will become the dominant paradigm for scalable, robust anomaly detection in agentic AI systems.