CERT Insider Threat Dataset Overview

Updated 17 January 2026

CERT Insider Threat Dataset is a synthetic benchmark produced by CMU’s CERT Division that simulates comprehensive enterprise logs with annotated insider threat scenarios.
It offers multi-modal log streams—such as logon, file, email, device, and HTTP events—enabling feature engineering via frequency vectors, one-hot encoding, and graph-based models.
The dataset supports diverse detection methods including CNNs, Transformers, and autoencoders while challenging models with extreme class imbalance and synthetic data limitations.

The CERT Insider Threat Dataset is a synthetic insider-threat benchmark published by Carnegie Mellon University’s CERT Division to support empirical evaluation and methodological innovation in behavioral anomaly detection, scenario analysis, and adversarial classification within enterprise environments. It simulates organizational activity logs for thousands of employees over extended multi-month intervals and injects ground-truth insider threat scenarios (including sabotage, data exfiltration, and espionage). The corpus has evolved through several releases (notably v4.2, v5.2, v6.2), each characterized by specific user counts, log modalities, threat narratives, and schema granularity. NOT designed for privacy-constrained environments, the dataset enables rigorous machine learning and statistical analysis of both individual and collective user behaviors under complex, highly imbalanced class conditions.

1. Dataset Structure and Scope

CERT Insider Threat releases—v4.2, v5.2, v6.2, documented across numerous studies (G et al., 2019, Elbasheer et al., 30 Jun 2025, Pantelidis et al., 2021, Tuor et al., 2017, Noever, 2019)—simulate user activity over 17–18 months for 1,000–4,000 distinct employees. Principal log streams include:

logon.csv: Session-level logon/logoff events, with categorical session type, per-user, per-computer, with timestamp granularity.
file.csv: File operations (read/write/copy/delete, indicated by multi-bit flags), optionally tagged for removable media access and decoy file interaction.
device.csv: Connect/disconnect external-device events, mainly USB.
email.csv: Email send/receive records, including sender/recipient meta-data, activity type, attachment and content presence.
http.csv: Web browsing (URL, category, bytes transferred).
Ancillary logs: psychometric assessments, user LDAP/group membership, monthly employment rosters, explicit adversary narrative metadata (Pantelidis et al., 2021, Noever, 2019).

The scale is substantial. v4.2 comprises 3,320,452 log lines over 501 days for 1,000 users (G et al., 2019). v6.2 totals 135.1 million events for 4,000 users over 516 weekdays (Tuor et al., 2017, Paul et al., 2020).

Malicious events (scenarios) are rare: e.g., 30/1,000 users in r4.2 are assigned malicious narratives, yielding user/day anomaly rates of 0.4 %–3 % and event-wise anomaly rates as low as 10⁻⁶ (G et al., 2019, Garchery et al., 2020). Ground-truth is line-level or day-level, annotated within “answer” files for each scenario.

2. Feature Engineering and Aggregation

Raw log events are aggregated into structured feature representations prior to modeling.

Frequency-based vectors: Aggregation of per-user/day activity frequencies across L1–L9 (logon), E1–E5 (email), F1 (file), D1–D3 (device), and H1–H3 (http), yielding 20-dimensional daily vectors (G et al., 2019).
Categorical and count features: One-hot encoding of activity types yields e.g., 50-dimensional binary vectors (AE input layer size) (Pantelidis et al., 2021). Aggregated session/day counts are preferred for class balance and computational tractability (Sarraf, 10 Jan 2026).
Temporal and contextual statistics: z-score logon-time abnormality; probability of USB usage; employment status flags; psychometric clustering; z-normalization across user-months (Hall et al., 2019, Noever, 2019).
Graph-based edge-attribute sequences: Modeling inter-entity (user→device, user→domain, sender→recipient) directed edges with time-sinusoidal encodings and embedding-bag pooling for recipients/text (Garchery et al., 2020).
Hybrid multi-modal features: Sentiment scores via AFINN; domain and keyword TF–IDF; event-types mapped to risk taxonomies (Noever, 2019).

Normalizations employ min–max scaling, z-score, and, in advanced pipelines, principal components analysis (PCA) for dimensionality reduction (capturing >95 % variance) (Sarraf, 10 Jan 2026).

3. Labeling Strategy and Class Imbalance

CERT datasets provide scenario-derived ground truth with explicit mapping of user-days/events to malicious/benign labels.

User/day labels: Each record, after aggregation, is flagged “malicious” or “non-malicious” per scenario narrative (G et al., 2019, Hall et al., 2019). r5.2 and v6.2 extend this to five threat categories (WikiLeaks exfiltration, career jump, admin betrayal, lateral movement, Dropbox upload) (Rastogi et al., 2021, Paul et al., 2020, Garchery et al., 2020).
Event-level anomaly rates: In v6.2, the overall anomalous event rate is ≈10⁻⁶; on user-days, ≈10⁻⁵ (Garchery et al., 2020).
Sampling and balancing tactics: To address severe imbalance (e.g., 241 : 1 ratio for non-malicious : malicious in v4.2), methods include random under-sampling, synthetic minority oversampling (SMOTE, k=3 neighbors), and spreading negative samples to ensure sufficient positive coverage (G et al., 2019, Sarraf, 10 Jan 2026, Hall et al., 2019).

4. Modeling Methodologies

CERT datasets have been basis for diverse modeling architectures.

Image-based CNN classification: Aggregated 20-dim daily vectors encoded as 32×32 grayscale images; transfer learning with VGG16, MobileNet, ResNet; binary softmax classification (G et al., 2019). Test-set precision/recall exceeds 99 %.
User-based sequence modeling (UBS): Session-level vectors (35 features) for 501 days×9 sessions/user (shape [4509,F]); fed to multi-layer Transformer encoder (6 layers, 512 d_model, attention heads=8) with MSE reconstruction loss (Elbasheer et al., 30 Jun 2025). Reconstruction errors are scored via OCSVM, LOF, iForest, achieving recall ≈ 99.43 % and AUROC ≈ 95 %.
Deep Autoencoder/Variational AE: One-hot encoded binary vectors (e.g., 50-dim, no scaling beyond one-hot) used for unsupervised outlier detection (Pantelidis et al., 2021). VAE outperforms AE in detection accuracy.
LSTM and sequence workflows: Per-user/control-flow modeled as sequences of log-keys, categorized; prediction on next-token, anomaly detection via MSE and deviation from expected sequence (Rastogi et al., 2021).
Attributed graph edges (ADSAGE): Per-event embeddings (source, target, time, numeric/text), edge-level anomaly detection via FFNN or RNN sequence scoring; recall-at-budget evaluation (Garchery et al., 2020).
Meta-ensemble classifiers: Stack-ensemble (ANN, NB, SVM, RF, LR, AdaBoost) with probability voting, confusion-matrix analysis, and ROC metrics (AUC ≈ 0.988) (Hall et al., 2019, Sarraf, 10 Jan 2026, Noever, 2019).
LLM log analysis (RedChronos): Textual prompt construction from session events; LLM fusion by weighted voting (weights computed from test-set precision, detection rate, accuracy, FPR); semantic prompt evolution via genetic algorithm with LLM mutation (Li et al., 4 Mar 2025). Automated LLM dispatch achieves precision 0.933, detection rate 0.987, FPR 0.022, and accuracy 0.979 on r4.2.

5. Evaluation Protocols and Performance Metrics

Rigorous evaluation harnesses:

Hold-out and cross-validation splits: Common splits are 75/25 %, 80/20 %, or stratified/temporal (e.g., train=days 1–418, test=419–516) (G et al., 2019, Tuor et al., 2017, Sarraf, 10 Jan 2026).
Recall at budget (CR-k): Popular for operational deployments; e.g., LSTM-Diag and ADSAGE recover ≈90 % of threats by reviewing top 250–400 user-days (96 % recall at top 1,000, near 99 % at 4,000) (Tuor et al., 2017, Garchery et al., 2020).
Confusion matrix, ROC, and AUC: Meta-ensemble classifier on r4.2 in July achieves accuracy 96.2 %, TPR 0.78, FPR 0.03, AUC 0.988 (Hall et al., 2019).
False-positive and false-negative rates: Model comparison (UBS-Transformer+iForest, OCSVM) achieves FPR <6 % and recall >99 %; CNN image-based approach yields precision=recall=99.32 % (Elbasheer et al., 30 Jun 2025, G et al., 2019).
Percentiles and score decomposition: In v6.2, flagged threat user-days average at 95.53th percentile across anomaly scores (Tuor et al., 2017).

6. Dataset Limitations and Challenges

Synthetic nature: All data are simulated; thus the diversity of threat types, concept drift, background user behavior, and timing structures may lack full real-world representativeness (Pantelidis et al., 2021, Sarraf, 10 Jan 2026).
Extreme class imbalance: Typical anomaly rates demand aggressive sampling, synthetic oversampling (SMOTE), and cost-sensitive modeling (G et al., 2019, Sarraf, 10 Jan 2026, Noever, 2019).
Feature and dimensionality complexity: Releases present up to 830 features, requiring dimensionality reduction (PCA) and careful normalization to avoid the curse of dimensionality (Sarraf, 10 Jan 2026).
Labeling coarseness: v4.2/v5.2/v6.2 scenario “answers” files provide user/day-level or event-level labels, but not more nuanced or fuzzy threat scores.
Operational challenges: Deployment in live Security Operation Centers necessitates prompt evolution, query-aware voting, and continual adaptation to organizational TTP drift (Li et al., 4 Mar 2025).

7. Implications for Research and Practice

The CERT Insider Threat Dataset serves as the canonical benchmark for algorithmic, statistical, and deep learning approaches in insider threat detection. It is referenced across detection paradigms (unsupervised anomaly, supervised classification, graph sequence modeling, LLM-based log analysis) and forms the basis for comparative performance evaluation. Its synthetic structure, highly imbalanced class distribution, and availability of fine-grained logs allow researchers to prototype, validate, and extend state-of-the-art methods prior to real-world implementation. However, a plausible implication is that external validity in actual SOCs may be limited by the assumptions and scenario coverage of the synthetic corpus—a recurrent observation in the literature (Pantelidis et al., 2021, Li et al., 4 Mar 2025, Sarraf, 10 Jan 2026).

The dataset’s influence on methodology (sampling, label assignment, evaluation metric selection, feature engineering) is substantial, as shown by the wide spectrum of models validated against it, from transfer-learned CNNs (G et al., 2019) to Transformer-based sequential anomaly scoring (Elbasheer et al., 30 Jun 2025), autoencoders (Pantelidis et al., 2021), meta-classifiers (Hall et al., 2019), community-aware LSTM networks (Paul et al., 2020), and edge-level FFNNs (Garchery et al., 2020). This suggests that algorithmic fusion, continuous model adaptation, and interpretability-enhanced approaches will remain central to next-generation insider threat research anchored in the CERT corpus.