CERT r5.2 and r6.2 Datasets Overview

Updated 27 December 2025

CERT r5.2 and r6.2 datasets are synthetically generated corpora that simulate insider threat scenarios through detailed activity logs and segmented sub-sessions.
They enable rigorous benchmarking of ITD models by supporting evaluation of graph-based and temporal deep learning frameworks with controlled threat windows.
The datasets’ structured preprocessing, explicit and implicit graph constructs, and tailored evaluation metrics significantly enhance anomaly detection research.

The CERT r5.2 and r6.2 datasets are synthetically generated corpora designed to facilitate research in insider threat detection (ITD). Emphasizing real-world complexity via fine-grained activity logs, these datasets underpin benchmarking and methodological studies in behavioral modeling, anomaly detection, and sequence-based forensics. The datasets have achieved prominence in evaluating advanced computational frameworks—most notably, graph-based and temporal deep learning models—as evidenced by their use in recent state-of-the-art detection systems such as the GCN and Bi-LSTM framework with explicit and implicit graph representations (Yumlembam et al., 20 Dec 2025).

1. Dataset Composition and Preprocessing

CERT r5.2

CERT r5.2 comprises log traces from 10 randomly selected users. The training and validation window spans all of 2010, while the test partition is six months (January–June 2011). User activity is segmented into "sessions," beginning at logon and terminating at logoff (or carried over to the next logon if autologoff is absent). Sessions are divided into "sub-sessions" of up to $n=50$ activities, with smaller fragments (<5 activities) merged with immediate neighbors.

Each sub-session yields 50 masked sequences (masking applied sequentially to each activity). The dataset statistics are: 146,721 benign and 84 malicious masked sub-sequences for training; 17,429 benign and 33 malicious sub-sequences for testing. The positive (malicious insider) ratio in train is approximately $0.057\%$ (computed as $84/(146,721+84)$).

CERT r6.2

CERT r6.2 includes all 29 synthetic "insider" users with user-specific windows covering threat periods: either July–September 2010 or January–April/October–December 2010. Post-filtering, the data are split randomly into 70% train and 30% test sets on the sub-sequence level. Training consists of 172,970 benign and 339 malicious sub-sequences; testing comprises 74,145 benign and 131 malicious sub-sequences. The positive ratio is approximately $0.20\%$ ($339/(172,970+339)$).

2. Feature Representation and Activity Encoding

Each raw activity $a_i$ , defined by its type and timestamp, is encoded to an integer code:

$c_i = \text{type}(a_i) \cdot 24 + \text{hour}(a_i)$

with $\text{type}(a_i) \in \{0,\ldots,7\}$ , labeling {Logon, Logoff, Email, HTTP, File Open, File Write, Connect, Disconnect}.

For every activity, a fixed-length feature vector $X_i$ is extracted, comprising modality-dependent entries:

Supervisor PC access (binary)
Assigned PC access (binary)
After-hours elapsed time (continuous)
Weekend access (binary)
File events: flags for removable media transfer, operation types, file indicators, flag-word counts
Email events: recipient counts, attachment sizes by type, flag-word counts
HTTP events: URLs, interaction types, flag-word counts
Device events: device connect/disconnect status

Preprocessing involves z-score normalization ( $x' = (x - \mu)/\sigma$ ) and one-hot encoding of the masked activity code for loss calculation.

3. Graph Construction Protocols

Explicit Graph ( $A_{\mathrm{exp}}$ )

Nodes correspond to the $n=50$ activity codes per sub-session. The adjacency matrix $A_{\mathrm{exp}} \in \{0, 1\}^{n \times n}$ is constructed such that:

$A_{\mathrm{exp}}[i, j] = 1$ when $|i - j| = 1$ (bi-directional sequential connectivity) or $\text{type}(c_i) = \text{type}(c_j)$ (same-type edges, e.g., Email-Email). Otherwise, $A_{\mathrm{exp}}[i, j] = 0$ .

Implicit Graph ( $A_{\mathrm{imp}}$ ) via Gumbel-Softmax

Embeddings $E \in \mathbb{R}^{n \times d}$ are computed as $E=W_{\mathrm{emb}}X + b_{\mathrm{emb}}$ . Pairwise similarity (unnormalized) is $\theta = (E E^T)/(\|E\|\,\|E^T\| + \epsilon)$ .

The discrete, differentiable adjacency $A_{\mathrm{imp}}$ is sampled using the Gumbel logistic formulation:

$A_{\mathrm{imp}}[i,j] = \sigma\Bigg(\frac{\log\left(\frac{\theta_{ij}}{1-\theta_{ij}}\right) + g_1 - g_2}{s}\Bigg)$

where $g_1, g_2 \sim \mathrm{Gumbel}(0,1)$ , $s$ is temperature, and $\sigma$ denotes the logistic sigmoid. The paper implements a sigmoid-based variant over the canonical softmax.

Hyperparameters: sub-session size $n=50$ , merge threshold $=5$ , embedding dimension $d=54$ , GCN hidden dimension $=16$ , attention heads $h=8$ . Gumbel temperature $s$ is unspecified.

4. Experimental Protocols and Splitting Strategies

Train/test splits precisely follow log chronology and insider windows:

CERT r5.2 training: all data from 2010; testing: January–June 2011.
CERT r6.2: user-specific threat windows filtered; 70%/30% random split of masked sub-sequences.

Sub-sessions are assigned binary labels (benign/insider), and anomaly detection is conducted post hoc using model-derived probability scores. At test time, each masked activity's predicted probability $y_m$ is thresholded:

Anomaly label: $\text{True}$ if $y_m < \tau$ .

Threshold selection fixes a target false positive rate $\mathrm{FPR_{target}}$ ($0.05$ for r5.2, $0.09$ for r6.2). For each scenario, the maximal threshold $t^*$ satisfying $\mathrm{FPR}(t^*) \leq \mathrm{FPR_{target}}$ is determined:

$t^* = \max\{ t | \mathrm{FPR}(t) \leq \mathrm{FPR_{target}} \}$

5. Evaluation Metrics and Benchmark Results

The principal quantitative evaluation employs the following metrics:

AUC (area under the ROC curve)
Detection Rate $\equiv$ True Positive Rate ( $\mathrm{DR} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$ )
False Positive Rate ( $\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}$ )

Table summarizing the main results for the proposed GCN+Bi-LSTM+Attention model:

Dataset	AUC	Detection Rate (TPR)	False Positive Rate (FPR)
CERT r5.2	98.62	100%	0.05
CERT r6.2	88.48	80.15%	0.15

These results demonstrate markedly stronger performance on r5.2 compared to r6.2, reflecting the greater complexity and challenge presented by r6.2’s multivariate insider behaviors.

6. Contextual Significance and Implications

The CERT r5.2 and r6.2 datasets enable controlled yet challenging evaluation of ITD models, especially those leveraging sequential and graph-structural information. The sub-session, masked-sequence, and synthetic threat window protocols facilitate fine-grained anomaly detection that simulates realistic organizational settings. These datasets are pivotal for validating models that incorporate explicit rules and implicit behavioral dependencies, as with the explicit and implicit graphs constructed in (Yumlembam et al., 20 Dec 2025).

A plausible implication is that the disparity in detection performance between r5.2 and r6.2 signifies the sensitivity of learned models to dataset complexity and the distribution of insider activities, thereby motivating ongoing enhancements in threat modeling strategies. The framework’s reliance on both explicit and implicit graphs further highlights how latent behavioral relationships—discovered via differentiable discrete adjacency—may alleviate noise inherent in hand-crafted structural representations.

PDF Markdown Chat (Pro)

References (1)

Insider Threat Detection Using GCN and Bi-LSTM with Explicit and Implicit Graph Representations (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CERT r5.2 and r6.2 Datasets.

CERT r5.2 and r6.2 Datasets Overview

1. Dataset Composition and Preprocessing

CERT r5.2

CERT r6.2

2. Feature Representation and Activity Encoding

3. Graph Construction Protocols

Explicit Graph ( $A_{\mathrm{exp}}$ )

Implicit Graph ( $A_{\mathrm{imp}}$ ) via Gumbel-Softmax

4. Experimental Protocols and Splitting Strategies

5. Evaluation Metrics and Benchmark Results

6. Contextual Significance and Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CERT r5.2 and r6.2 Datasets Overview

1. Dataset Composition and Preprocessing

CERT r5.2

CERT r6.2

2. Feature Representation and Activity Encoding

3. Graph Construction Protocols

Explicit Graph (AexpA_{\mathrm{exp}}Aexp​)

Implicit Graph (AimpA_{\mathrm{imp}}Aimp​) via Gumbel-Softmax

4. Experimental Protocols and Splitting Strategies

5. Evaluation Metrics and Benchmark Results

6. Contextual Significance and Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Explicit Graph ( $A_{\mathrm{exp}}$ )

Implicit Graph ( $A_{\mathrm{imp}}$ ) via Gumbel-Softmax