Security Copilot Phishing Triage Agent

Updated 3 July 2026

Security Copilot Phishing Triage Agent is an ML-based system that ingests and triages phishing alerts using one-hot encoding and PCA-derived features in a production SOC setting.
It employs a Random Forest classifier with a 0.9 confidence threshold and cosine similarity-based incident retrieval to ensure high analyst trust and contextual accuracy.
Integrated within Microsoft Defender XDR, the system couples triage with investigation and remediation, leveraging continuous human feedback for adaptive retraining.

Searching arXiv for the core paper and closely related phishing-triage agent papers to ground the article and confirm citations. Security Copilot Phishing Triage Agent denotes an instantiation of the triage component of Microsoft Copilot for Security Guided Response (CGR) for phishing-related events. In the synthesized design, it is a geo-distributed ML service that ingests phishing alerts, encodes them via one-hot encoding and PCA into a 40-dimensional vector $\phi(x)$ , applies a classifier trained on analyst-labeled incidents, filters predictions by a $0.9$ confidence threshold, and augments analysts with similar-incident lookups via IncidentHash and cosine similarity. Within the broader CGR framework, triage is one of three coordinated tasks—investigation, triaging, and remediation—and is integrated into Microsoft Defender XDR as part of a guided-response workflow for SOC analysts (Freitas et al., 2024).

1. Functional scope within guided response

Security operation centers contend with a constant stream of security incidents, ranging from straightforward to highly complex. CGR was developed as an industry-scale ML architecture that guides security analysts across three key tasks: investigation, providing essential historical context by identifying similar incidents; triaging to ascertain the nature of the incident—whether it is a true positive, false positive, or benign positive; and remediation, recommending tailored containment actions. In the phishing-focused instantiation, the emphasis is the triage component that classifies incoming phishing-related events into True-Positive (TP), False-Positive (FP), and Benign/Informational (BP) (Freitas et al., 2024).

CGR is integrated into the Microsoft Defender XDR product and deployed worldwide, generating millions of recommendations across thousands of customers. The phishing triage agent therefore belongs to a production SOC setting rather than to an isolated URL-classification benchmark. This suggests that its primary design objective is not merely standalone detection accuracy, but analyst guidance under operational constraints such as trust calibration, queue load, privacy boundaries, and downstream remediation (Freitas et al., 2024).

2. Ingestion, feature construction, and incident representation

Microsoft Defender XDR continuously streams alerts, including email-based alerts flagged for potential phishing, into Azure Data Lake Storage (ADLS). Every 15 minutes, the Inference Pipeline pulls the latest batch of alerts and applies the same feature-engineering pipeline used at training time. The categorical features are OrgId, DetectorId, ProductId, Category, and Severity. The numerical and engineered features comprise 67 total signals, including URL reputation score, domain-age, sender-IP threat score, antispam filter direction, attachment metadata, email header anomalies, message routing hops, suspicion level, and LastVerdict. One-hot encoding is applied after grouping rare IDs, followed by PCA projection to $k=40$ dimensions capturing $95\%$ variance (Freitas et al., 2024).

Alerts are then aggregated into incidents by IncidentId; numerical features are summed, and categorical embeddings are averaged or majority-voted. The phishing-specific signals built during feature construction cover email header anomalies such as SPF/DKIM failure flags and “From” versus envelope-sender mismatches; URL reputation from static blacklists and machine-learning–derived risk scores; attachment metadata including file type, embedded macros, and hash-based threat family labels; network indicators such as sender IP threat score, geolocation, and ASN; and antispam engine outputs including AntispamDirection, SuspicionLevel, and LastVerdict (Freitas et al., 2024).

This representation is incident-centric rather than message-centric. A plausible implication is that the agent is designed to exploit cross-alert regularities within a single incident boundary, rather than treating each email artifact as an independent sample.

3. Classification logic, confidence control, and feedback

The preprocessed incident vectors $x \in \mathbb{R}^{40}$ are fed to the triage model $f$ . For each incident $x$ , the model outputs class scores

$f(x) = \operatorname{softmax}(W\,\phi(x) + b) \in \Delta^3,$

where $\phi(x)$ is the PCA embedding, $W \in \mathbb{R}^{3 \times 40}$ and $0.9$0. The three classes are TP, FP, and BP. Confidence is computed as $0.9$1, and only predictions with confidence $0.9$2 are surfaced in order to ensure high analyst trust. The feature representation is specified as

$0.9$3

The training objective is multi-class cross-entropy,

$0.9$4

The design description also specifies the model type as a supervised classifier using Scikit-learn’s Random Forest, chosen over deep nets for tabular efficiency on CPU clusters. Hyperparameters $0.9$5 are tuned by grid search on a $0.9$6 stratified train/validation/test split, and inference-time thresholding enforces precision $0.9$7 via a confidence threshold $0.9$8 (Freitas et al., 2024).

The agent is explicitly human-in-the-loop. Analysts confirm or override the CGR suggestion directly in Defender XDR, and confirmed labels re-enter the weekly Train Pipeline, closing the loop for continuous learning. Production guidance further specifies explainability through top-3 feature importances for each prediction—for example, “High URL reputation score” or “SPF failure”—as well as weekly retraining using the latest confirmed labels (Freitas et al., 2024).

The operational emphasis on thresholding and analyst override differentiates triage from unconstrained batch classification. This suggests that the surfaced prediction set is deliberately smaller than the total incident population in exchange for higher trust and lower analyst-dispute cost.

4. Historical-context retrieval and analyst augmentation

CGR’s investigation sub-skill retrieves up to 5 prior incidents to show historical context. Matching proceeds in three steps. First, the system performs an exact IncidentHash match, defined as the SHA1 of the sorted DetectorId list. Second, it computes cosine similarity on PCA embeddings,

$0.9$9

Third, it takes top- $k=40$ 0 with $k=40$ 1, prioritizing exact matches and then those with $k=40$ 2 (Freitas et al., 2024).

This retrieval mechanism functions as an evidence-reuse layer for phishing triage. Rather than asking analysts to interpret a class label in isolation, the agent can present historically similar incidents, thereby supplying case-based context for why an incident is likely TP, FP, or BP. The same production guidance states that triage decisions can be combined with Copilot’s language-model incident summaries to give a unified analyst experience (Freitas et al., 2024).

A common misconception is to treat phishing triage agents as only verdict generators. In the CGR formulation, triage is materially linked to investigation through nearest-neighbor retrieval and to remediation through guided containment recommendations. The system therefore operates as a guided-response service rather than as a single-pass detector.

5. GUIDE dataset and quantitative performance

The principal data resource for this line of work is GUIDE, described as the largest public collection of real-world security incidents. In the phishing-triage configuration, the dataset contains 1 M incidents and 1.6 M alerts, with 13 M evidence rows across 33 entity types such as URLs, IPs, files, and emails. It includes 1 M triage-labeled incidents with TP, FP, and BP labels, as well as 26 k alert-level remediation labels. Ground truth is derived from SOC analyst grades stored in Defender XDR; incidents with multiple alerts use majority voting, with ties mapped to TP. Researchers can reproduce triage-model benchmarks using the 70/30 train/test split on Kaggle (Freitas et al., 2024).

On offline triage evaluation, macro-averaged over 12 regions, the reported performance is Precision $k=40$ 3, Recall $k=40$ 4, and F1 $k=40$ 5 at the model’s maximum macro-F1 point. When the operating point is shifted to enforce precision $k=40$ 6, the system covers $k=40$ 7 of phishing incidents. Error analysis reports that $k=40$ 8 of remaining errors are TP/BP or FP/BP confusions, while only $k=40$ 9 are critical TP versus FP swaps. Although ROC/AUC is not the primary operational metric because precision–recall is used for skewed classes, typical per-region AUCs exceed $95\%$ 0 for the TP versus $95\%$ 1 decision (Freitas et al., 2024).

These results formalize the trade-off between coverage and analyst trust. The agent is not presented as covering all phishing incidents at the strict precision target. Instead, it operates as a high-confidence triage layer over a larger alert stream.

6. Deployment practice and analyst-facing impact

Production deployment guidance specifies high-confidence thresholding to minimize analyst distrust, adaptive retraining with weekly confirmed labels to adapt to new phishing tactics and new detectors, regional isolation through geo-replicated data and models to satisfy privacy and regulatory constraints, and monitoring of triage coverage percentage together with per-class precision/recall drift to trigger out-of-cycle retraining if performance falls below defined SLAs (Freitas et al., 2024).

A separate randomized controlled trial evaluated the Microsoft Security Copilot Phishing Triage Agent in analyst workflow. The study recruited 167 professional security analysts, with three arms: Control ( $95\%$ 2), Aware ( $95\%$ 3), and Blind ( $95\%$ 4). Each analyst triaged a forced-choice queue of 25 emails composed of 4 real malicious, 19 real benign, 1 synthetic false positive, and 1 synthetic false negative. Under corpus ground truth, Aware versus Control produced a 6.53× increase in true positives per minute and an F1 change from $95\%$ 5 to $95\%$ 6, reported as a $95\%$ 7 improvement. Behavioral analysis further showed that agent-augmented analysts spent $95\%$ 8 more time on malicious emails, and the study reported no significant increase in blind acceptance of agent true positives or synthetic false positives (Bono, 17 Nov 2025).

The RCT isolates two workflow mechanisms: queue prioritization and verdict explanations. In the “resolve-benign” protocol, analysts only review items flagged malicious by the agent. This suggests that the practical value of the phishing triage agent lies not only in per-item classification quality, but also in reordering analyst attention.

7. Relation to adjacent phishing-agent paradigms

Recent phishing-agent research situates the Security Copilot formulation within a broader design space. Interactive URL forensics systems such as TraceScope decouple a sandboxed operator agent, an immutable evidence bundle, and an adjudicator agent that verifies a MITRE ATT&CK–style checklist; on 708 reachable URLs, TraceScope reports Precision $95\%$ 9, Recall $x \in \mathbb{R}^{40}$ 0, and $x \in \mathbb{R}^{40}$ 1 (Zhang et al., 23 Apr 2026). Multimodal webpage systems such as PhishAgent combine offline and online knowledge bases with MLLMs and report, on TR-OP, $x \in \mathbb{R}^{40}$ 2 ACC, $x \in \mathbb{R}^{40}$ 3 F1, and $x \in \mathbb{R}^{40}$ 4 s/sample (Cao et al., 2024). A two-tiered URL-first then multimodal agentic design reports, for GPT-4o mini, a two-tier system with Acc $x \in \mathbb{R}^{40}$ 5, P $x \in \mathbb{R}^{40}$ 6, R $x \in \mathbb{R}^{40}$ 7, and F1 $x \in \mathbb{R}^{40}$ 8, while processing about $x \in \mathbb{R}^{40}$ 9 times as many websites per \$100 as the always-multimodal alternative (Trad et al., 2024).

Other strands emphasize conversational semantics, debate, or prompt robustness. Cyri performs local email analysis with Meta-Llama-3.1-8B and reports final classification Accuracy $f$ 0, Precision $f$ 1, Recall $f$ 2, and F1 $f$ 3 on an 840-email dataset (Torre et al., 9 Feb 2025). Debate-structured systems report that mixed-agent configurations such as GPT-4–LLaMA-2–GPT-4 outperform homogeneous ensembles across multiple phishing email datasets (Nguyen et al., 27 Mar 2025), while PhishDebate reports Recall $f$ 4, Precision $f$ 5, Accuracy $f$ 6, and F1 $f$ 7 on real-world phishing website datasets (Li et al., 18 Jun 2025). By contrast, prompt-robustness work argues that prompt-model interaction is a first-order security variable: a single model’s phishing bypass rate can range from under $f$ 8 to $f$ 9 depending on configuration, and optimized prompts can achieve up to $x$ 0 recall at $x$ 1 false positive rate while creating brittle single-signal dependence; that work introduces Safetility as a deployability-aware metric and argues that closing the adversarial gap likely requires tool augmentation with external ground truth (Litvak, 26 Mar 2026).

Taken together, these systems indicate that “phishing triage agent” now names a family of architectures rather than a single algorithmic recipe. This suggests three recurrent axes of variation: incident-level tabular triage versus webpage-level multimodal analysis, single-pass classification versus interactive evidence collection, and prompt-only reasoning versus reasoning grounded in tool outputs and analyst feedback. Within that space, the Security Copilot Phishing Triage Agent is most precisely characterized as a SOC-oriented, incident-centric, high-confidence guided-response service anchored in analyst-labeled incidents and continuous human feedback (Freitas et al., 2024).