CERT r5.2 and r6.2 Datasets Overview
- CERT r5.2 and r6.2 datasets are synthetically generated corpora that simulate insider threat scenarios through detailed activity logs and segmented sub-sessions.
- They enable rigorous benchmarking of ITD models by supporting evaluation of graph-based and temporal deep learning frameworks with controlled threat windows.
- The datasets’ structured preprocessing, explicit and implicit graph constructs, and tailored evaluation metrics significantly enhance anomaly detection research.
The CERT r5.2 and r6.2 datasets are synthetically generated corpora designed to facilitate research in insider threat detection (ITD). Emphasizing real-world complexity via fine-grained activity logs, these datasets underpin benchmarking and methodological studies in behavioral modeling, anomaly detection, and sequence-based forensics. The datasets have achieved prominence in evaluating advanced computational frameworks—most notably, graph-based and temporal deep learning models—as evidenced by their use in recent state-of-the-art detection systems such as the GCN and Bi-LSTM framework with explicit and implicit graph representations (Yumlembam et al., 20 Dec 2025).
1. Dataset Composition and Preprocessing
CERT r5.2
CERT r5.2 comprises log traces from 10 randomly selected users. The training and validation window spans all of 2010, while the test partition is six months (January–June 2011). User activity is segmented into "sessions," beginning at logon and terminating at logoff (or carried over to the next logon if autologoff is absent). Sessions are divided into "sub-sessions" of up to activities, with smaller fragments (<5 activities) merged with immediate neighbors.
Each sub-session yields 50 masked sequences (masking applied sequentially to each activity). The dataset statistics are: 146,721 benign and 84 malicious masked sub-sequences for training; 17,429 benign and 33 malicious sub-sequences for testing. The positive (malicious insider) ratio in train is approximately (computed as $84/(146,721+84)$).
CERT r6.2
CERT r6.2 includes all 29 synthetic "insider" users with user-specific windows covering threat periods: either July–September 2010 or January–April/October–December 2010. Post-filtering, the data are split randomly into 70% train and 30% test sets on the sub-sequence level. Training consists of 172,970 benign and 339 malicious sub-sequences; testing comprises 74,145 benign and 131 malicious sub-sequences. The positive ratio is approximately ($339/(172,970+339)$).
2. Feature Representation and Activity Encoding
Each raw activity , defined by its type and timestamp, is encoded to an integer code:
with , labeling {Logon, Logoff, Email, HTTP, File Open, File Write, Connect, Disconnect}.
For every activity, a fixed-length feature vector is extracted, comprising modality-dependent entries:
- Supervisor PC access (binary)
- Assigned PC access (binary)
- After-hours elapsed time (continuous)
- Weekend access (binary)
- File events: flags for removable media transfer, operation types, file indicators, flag-word counts
- Email events: recipient counts, attachment sizes by type, flag-word counts
- HTTP events: URLs, interaction types, flag-word counts
- Device events: device connect/disconnect status
Preprocessing involves z-score normalization () and one-hot encoding of the masked activity code for loss calculation.
3. Graph Construction Protocols
Explicit Graph ()
Nodes correspond to the activity codes per sub-session. The adjacency matrix is constructed such that:
- when (bi-directional sequential connectivity) or (same-type edges, e.g., Email-Email). Otherwise, .
Implicit Graph () via Gumbel-Softmax
Embeddings are computed as . Pairwise similarity (unnormalized) is .
The discrete, differentiable adjacency is sampled using the Gumbel logistic formulation:
where , is temperature, and denotes the logistic sigmoid. The paper implements a sigmoid-based variant over the canonical softmax.
Hyperparameters: sub-session size , merge threshold , embedding dimension , GCN hidden dimension , attention heads . Gumbel temperature is unspecified.
4. Experimental Protocols and Splitting Strategies
Train/test splits precisely follow log chronology and insider windows:
- CERT r5.2 training: all data from 2010; testing: January–June 2011.
- CERT r6.2: user-specific threat windows filtered; 70%/30% random split of masked sub-sequences.
Sub-sessions are assigned binary labels (benign/insider), and anomaly detection is conducted post hoc using model-derived probability scores. At test time, each masked activity's predicted probability is thresholded:
- Anomaly label: if .
Threshold selection fixes a target false positive rate ($0.05$ for r5.2, $0.09$ for r6.2). For each scenario, the maximal threshold satisfying is determined:
5. Evaluation Metrics and Benchmark Results
The principal quantitative evaluation employs the following metrics:
- AUC (area under the ROC curve)
- Detection Rate True Positive Rate ()
- False Positive Rate ()
Table summarizing the main results for the proposed GCN+Bi-LSTM+Attention model:
| Dataset | AUC | Detection Rate (TPR) | False Positive Rate (FPR) |
|---|---|---|---|
| CERT r5.2 | 98.62 | 100% | 0.05 |
| CERT r6.2 | 88.48 | 80.15% | 0.15 |
These results demonstrate markedly stronger performance on r5.2 compared to r6.2, reflecting the greater complexity and challenge presented by r6.2’s multivariate insider behaviors.
6. Contextual Significance and Implications
The CERT r5.2 and r6.2 datasets enable controlled yet challenging evaluation of ITD models, especially those leveraging sequential and graph-structural information. The sub-session, masked-sequence, and synthetic threat window protocols facilitate fine-grained anomaly detection that simulates realistic organizational settings. These datasets are pivotal for validating models that incorporate explicit rules and implicit behavioral dependencies, as with the explicit and implicit graphs constructed in (Yumlembam et al., 20 Dec 2025).
A plausible implication is that the disparity in detection performance between r5.2 and r6.2 signifies the sensitivity of learned models to dataset complexity and the distribution of insider activities, thereby motivating ongoing enhancements in threat modeling strategies. The framework’s reliance on both explicit and implicit graphs further highlights how latent behavioral relationships—discovered via differentiable discrete adjacency—may alleviate noise inherent in hand-crafted structural representations.