SecureBERT: Cybersecurity Transformer Models

Updated 30 March 2026

SecureBERT is a series of encoder-only transformer models designed specifically for cybersecurity, featuring domain-focused pre-training and specialized tokenization.
It leverages advanced architectural enhancements such as ModernBERT’s hierarchical attention and long-context processing to boost performance on tasks like intrusion detection.
Empirical results demonstrate its superiority in NER, semantic search, and classification benchmarks, making it vital for cyber threat intelligence and vulnerability analysis.

SecureBERT refers to a series of encoder-only, transformer-based LLMs specifically designed and pre-trained for cybersecurity applications, notably cyber threat intelligence (CTI), intrusion detection, and the analysis of technical cybersecurity and code corpora. The SecureBERT family comprises several generations, each extending BERT-style architectures for advanced domain adaptation, robust tokenization of security-specific vocabulary, and specialized downstream functionality. SecureBERT models are distinguished by their domain-focused pre-training regimes, innovative architectural enhancements, and demonstrated superiority in both language understanding and intrusion detection across multiple cybersecurity-relevant benchmarks (Aghaei et al., 2022, Bayer et al., 2022, Li et al., 2023, Aghaei et al., 30 Sep 2025).

1. Architectural Design and Domain Adaptation

SecureBERT adapts the RoBERTa-base architecture—twelve transformer encoder layers, hidden size 768, twelve self-attention heads—for domain-specific representation. Later variants (SecureBERT 2.0) replace vanilla transformer layers with ModernBERT, which implements extended local/global attention patterns and hierarchical encoding to permit long-context understanding for documents up to 1024 tokens (Aghaei et al., 30 Sep 2025).

Pre-training relies on a large, heterogeneous corpus comprising cybersecurity reports, vulnerability databases, threat advisories, blogs, and in SecureBERT 2.0, extensive code corpora. Tokenization uses bespoke BPE or hybrid WordPiece strategies, preserving high-frequency cybersecurity terms as atomic tokens and mapping identifiers, operators, and structural indicators from code in a unified vocabulary of up to 50,265 tokens (Aghaei et al., 2022, Aghaei et al., 30 Sep 2025).

Domain specificity is achieved by training on up to 13.6 billion domain tokens and over 50 million code tokens. SecureBERT’s embedding initialization includes explicit Gaussian noise injection for domain robustness and to mitigate overfitting on relatively small, in-domain corpora (Aghaei et al., 2022, Li et al., 2023).

2. Pre-training Objectives and Fine-tuning Protocols

The fundamental pre-training objective is Masked Language Modeling (MLM), with masking tailored to prioritize nouns or verbs in text, and contiguous identifiers or operators in code (Aghaei et al., 2022, Aghaei et al., 30 Sep 2025). The loss function for MLM is:

$\mathcal{L}_{\mathrm{MLM}} = -\mathbb{E}_{x\sim D}\sum_{i\in M(x)}\log P_\theta(x_i \mid x_{\setminus M(x)})$

where $M(x)$ is the set of masked positions.

Fine-tuning protocols involve optimizing all parameters (standard practice) or, where efficient adaptation is required, parameter-efficient adapters such as Low-Rank Adaptation (LoRA). LoRA modules decompose weight updates in the attention mechanism as $W \approx W_0 + UV$ with $U\in\mathbb{R}^{d\times r}$ , $V\in\mathbb{R}^{r\times d}$ , $r\ll d$ . This adds $\sim$ 0.57% overhead for a 7B parameter base but enables rapid adaptation and head swapping for task transfer (Li et al., 2023).

3. Evaluation Tasks and Empirical Results

SecureBERT and its successors are evaluated on masked token prediction, domain-specific NER, classification, semantic search, and intrusion detection tasks:

Masked Language Modeling: In the cybersecurity domain, SecureBERT achieves superior top-1 accuracy for masked nouns (35% vs. 22% for RoBERTa-base and 28% for SciBERT), and for verbs (similar gap), with top-5 and top-10 accuracy consistently higher (Aghaei et al., 2022).
Named Entity Recognition (NER): Fine-tuned SecureBERT attains the highest F1 scores (86.65%) among tested models (RoBERTa-base: 86.2%, SciBERT: 84.49%). SecureBERT 2.0 improves further (F1=0.9450), with balanced precision and recall on CyNER (Aghaei et al., 30 Sep 2025).
Document Semantic Search: Bi- and cross-encoder versions of SecureBERT 2.0 achieve Recall@1 of 88.7–88.8% and mAP of 92.1–92.4%, outperforming AttackBERT (R@1=74.4%) and all-MiniLM-L6-v2 (R@1=77.9%) (Aghaei et al., 30 Sep 2025).
Classification: On security alert classification, SecureBERT is superior (F1=0.8883, MS Exchange CTI F1=0.8869 vs. BERT and CyBERT baselines) (Bayer et al., 2022).
Intrusion Detection (CAN-SecureBERT): On the CAN intrusion classification task, SecureBERT achieves test balanced accuracy and F1 of 0.999991, precision/recall of 0.999991, and a false alarm rate of $3.5\times 10^{-6}$ , reducing FAR by 170x compared to MTH-IDS (Li et al., 2023).

Model	BA	PREC	DR	FAR	F1
MTH-IDS	0.999990	—	0.999990	$6.00 \times 10^{-4}$	0.999990
CAN-SecureBERT (10%)	0.999991	0.999991	0.999991	$3.50 \times 10^{-6}$	0.999991
CAN-LLAMA2 (10%)	0.999993	0.999993	0.999993	$M(x)$ 0	0.999993

4. Datasets and Preprocessing

Corpora for SecureBERT pre-training aggregate threat reports (MITRE, NIST), blogs, vulnerability databases (NVD), arXiv cryptography/security papers, security-relevant Twitter streams, cybersecurity dialogues, and annotated code vulnerability samples (Aghaei et al., 2022, Bayer et al., 2022, Aghaei et al., 30 Sep 2025). The CAN-SecureBERT model adapts SecureBERT to raw CAN message streams from the Car Hacking Dataset (Hyundai YF Sonata) under diverse attack types, employing minimal preprocessing aside from tokenization (Li et al., 2023).

For CAN intrusion, dataset balancing uses 1% or 10% of training points, with normal:attack ratios of 1:10, and evaluation on five classes (Normal, DoS, Fuzzy, RPM spoof, Gear spoof).

Attack Type	Total	Normal	Injected
DoS	3,665,771	3,078,250	587,521
Fuzzy	3,838,860	3,347,013	491,847
Spoofing Gear	4,443,142	3,845,890	597,252
Spoofing RPM	4,621,702	3,966,805	654,897
Normal	988,987	988,872	NA

5. Model Implementation, Hyperparameters, and Practical Integration

SecureBERT models employ AdamW optimization (learning rates $M(x)$ 1 to $M(x)$ 2), batch sizes from 4 to 64, and weight decay of 0.01. Training hardware includes multi-GPU setups (NVIDIA V100, RTX 3090) (Aghaei et al., 2022, Li et al., 2023). Open-source codebases and datasets are maintained under permissive licensing, facilitating adoption in enterprise SOCs and research.

SecureBERT 2.0’s architecture and robust pretraining have enabled integration into incident triage systems (high-throughput semantic retrieval), vulnerability assessment pipelines (automated code scoring), NER-driven report analysis for SIEM/SOAR, and retrieval-augmented generation for LLMs (Aghaei et al., 30 Sep 2025).

6. Limitations and Prospects

SecureBERT demonstrates leading performance but incurs high computational requirements for large models (e.g., CAN-LLAMA2 at 14 messages/s vs. 900 messages/s for SecureBERT), and requires substantial VRAM for long-context or multi-modal tasks. No statistical confidence intervals are reported for metrics. A plausible implication is that further work is required on model compression and quantization to enable sub-millisecond inference for deployment in real-time environments.

Future research directions explicitly outlined include compression via quantization and pruning, hybrid adapter–bit-pruning pipelines, and extension to multi-modal ICS sources (e.g., combining CAN signals with ECU voltage traces). The modularity in SecureBERT 2.0 (adapters, code-aware objectives, retrieval heads) supports continued adaptation for new cyber defense paradigms (Li et al., 2023, Aghaei et al., 30 Sep 2025).

SecureBERT has been critically compared to other domain-adapted transformers, such as CySecBERT and CyBERT. Empirical evaluation indicates that SecureBERT’s more comprehensive dataset construction, tokenizer customization, and noise-regularized adaptation yield marked improvements in intrinsic embedding quality, clustering, NER, and domain classification—with only minor general-language performance tradeoffs (SuperGLUE Δ=–0.054) (Bayer et al., 2022). The evolution from SecureBERT to SecureBERT 2.0 further strengthens semantic search (R@1 +15pp), NER (F1 +0.21), and code vulnerability analysis (accuracy +0.028) relative to both prior SecureBERT and code-oriented baselines (Aghaei et al., 30 Sep 2025).