XG-Guard: Fine-Grained Anomaly Detection
- XG-Guard is an explainable, fine-grained safeguarding framework designed to detect anomalies in multi-agent systems by leveraging both sentence and token-level textual analysis.
- It utilizes a bi-level agent encoding architecture with dual GNN streams and adaptive prototype mechanisms to robustly identify malicious behaviors such as prompt injection and memory poisoning.
- Experimental results demonstrate high detection accuracy across diverse MAS topologies and attack modalities, while token-level explanations provide actionable insights into anomaly contributions.
XG-Guard is an explainable, fine-grained safeguarding framework for anomaly detection in LLM-based multi-agent systems (MAS). It addresses the limitations of prior graph anomaly detection techniques by incorporating both sentence- and token-level textual information, enabling robust, unsupervised detection of malicious agents while providing intrinsic interpretability through token-level scoring. XG-Guard is particularly designed for scenarios where agents interact through natural-language messages and where the risk of prompt injection, memory poisoning, or covert manipulation can compromise MAS integrity (Pan et al., 21 Dec 2025).
1. Motivation and Context
The rise of autonomous MAS powered by LLMs has enabled collaborative solutions to complex tasks but also introduced new attack surfaces. Malicious agents can conduct prompt injection, tool misuse, or memory poisoning, propagating harmful states or misinformation through otherwise trustworthy interactions. Previous GAD-based MAS defenses typically represent each agent as a graph node with a single sentence embedding (often via SentenceBERT), flagging as anomalous those nodes with outlying representations. This coarse approach is insufficient for detecting subtle semantic manipulations, such as token-scale prompt injection or context-sensitive violations, and provides no granular explanations for anomaly decisions. These deficiencies motivate XG-Guard’s bi-level modeling and requirement for token-level explainability in detection (Pan et al., 21 Dec 2025).
2. Bi-Level Agent Encoding Architecture
XG-Guard models a MAS as an attributed directed graph , where each vertex corresponds to an agent, and each directed edge encodes communication topology. For every agent response :
- Sentence-Level Feature Extraction: .
- Token-Level Feature Extraction: For each token , .
- Contextual Augmentation: Token embeddings are augmented with sentence embeddings: .
These representations are then processed by two graph neural network (GNN) streams:
- : Aggregates global sentence context, yielding .
- : Aggregates local token-level information, preserving contextual augmentation and producing per-token embeddings (Pan et al., 21 Dec 2025).
3. Theme-Adaptive Anomaly Detection
To ensure robustness to evolving MAS dialogue themes, XG-Guard implements a dual prototype mechanism:
- Sentence Prototype:
- Token Prototype:
Anomaly scoring for each agent utilizes distance computations to these prototypes:
- Sentence-level score:
- Token-level score:
This adaptation enables XG-Guard to flag deviations specific to the prevailing thematic context, as opposed to static outlier detection (Pan et al., 21 Dec 2025).
4. Bi-Level Score Fusion and Explainability
Naive averaging of sentence- and token-level anomaly scores can dilute signal when prototypes are contaminated by anomalous examples. XG-Guard normalizes batch anomaly scores and estimates their covariance:
where negative covariance downweights token-level scores in the final aggregate.
For interpretability, token-level explanation is performed by: Tokens with large positive are causally implicated in the agent’s anomaly score and can be visually highlighted—enabling direct inspection of sub-sentence evidence underlying detected failures (Pan et al., 21 Dec 2025).
5. Contrastive, Unsupervised Training Scheme
XG-Guard’s framework is entirely unsupervised, requiring no labeled examples of attack or benign sessions. Instead, training is structured around a contrastive objective: where each batch aligns agents with their dialogue’s prototype (positive) while repelling from randomly sampled alternative dialogue prototypes (negative). This enables learning of both GNN modules, supporting robust anomaly detection across dynamic MAS contexts (Pan et al., 21 Dec 2025).
6. Experimental Results and Empirical Insights
Comprehensive experiments spanned four principal MAS graph topologies (chain, tree, star, random) and six attack modalities, including prompt injection (CSQA, MMLU, GSM8K), tool attack (InjecAgent), and memory attack (PoisonRAG, CSQA). Compared to baselines—DOMINANT, PREM, TAM (unsupervised), BlindGuard, and supervised G-Safeguard—XG-Guard achieved:
- Chain topology AUC: ≈ 90%, surpassing BlindGuard (≈ 80%) and achieving > 95% of G-Safeguard’s AUC.
- ASR@3 reduction: 50–70% relative to no defense.
- Ablation findings: Removing the token stream reduced AUC by 9pp; replacing covariance-guided fusion with simple averaging reduced AUC by 11pp, substantiating the necessity of both core components.
Qualitative analysis revealed precise localization of anomaly-contributing tokens (e.g., “should be accepted as accurate”, “find the personal details”) while benign phrases remained unflagged (Pan et al., 21 Dec 2025).
7. Limitations, Challenges, and Future Directions
While XG-Guard demonstrates robust, interpretable safeguarding capacities, certain limitations persist:
- Reliance on frozen SentenceBERT: This can result in spurious attributions of anomaly to tokens with ambiguous embedding signatures, such as punctuation.
- Adaptability to evolving LLM backbones and unseen dialogue domains: Real-world systems may demand continual updating of underlying encoders.
- Future research: Adaptive encoders for disentangling mixed token representations, extension to edge-level graph anomaly explanations, and online adaptation for interaction with dynamically updated LLM APIs are highlighted as promising directions (Pan et al., 21 Dec 2025).
XG-Guard establishes a technical benchmark for unsupervised, explainable safeguarding of LLM-based multi-agent systems, combining bi-level text modeling, theme-adaptive anomaly detection, and fine-grained lexical contribution analysis, with strong empirical evidence of effectiveness and interpretability (Pan et al., 21 Dec 2025).