BERTector: Scalable IDS with BERT & LoRA
- BERTector is a scalable intrusion detection system that uses a traffic-aware NSS-Tokenizer to convert complex network flows into unified token sequences.
- It employs joint-dataset supervised fine-tuning on multiple benchmark datasets to improve cross-domain generalization and robust attack detection.
- By integrating LoRA, the framework achieves efficient parameter updates, significantly reducing trainable parameters while maintaining high accuracy under adversarial conditions.
BERTector is a scalable intrusion detection system (IDS) framework designed to address generalization and robustness challenges posed by heterogeneous network traffic and diverse attack patterns. Building on the BERT architecture, BERTector introduces three core innovations: the NSS-Tokenizer for protocol-semantic tokenization, joint-dataset supervised fine-tuning, and low-rank adaptation (LoRA) for efficient training. Evaluated across multiple benchmark datasets with rigorous adversarial robustness assessments, BERTector establishes a unified solution for high-performance IDS in complex and dynamic environments (Hu et al., 14 Aug 2025).
1. NSS-Tokenizer: Traffic-Aware Semantic Tokenization
Standard BERT tokenizers (such as WordPiece) are suboptimal for network traffic, as they fragment structured protocols, generate spurious subwords near punctuation, and yield long, noisy sequences. The NSS-Tokenizer is purpose-built to process network flows by identifying semantic boundaries between protocol fields and producing compact, meaningful token sequences.
Given a flow comprised of string or numeric values, the NSS-Tokenizer inserts a special, out-of-vocabulary delimiter (e.g., ⊞) between fields: Character-level splitting around these delimiters produces an initial token list . All token sequences are then dynamically truncated or padded to a uniform length as defined: Tokens are mapped to integer IDs via a vocabulary lookup, yielding a fixed-length representation for every flow: This approach allows concatenating flows with heterogeneous fields from different datasets using a single delimiter, bypassing the need for manual feature alignment or explicit padding of numeric vectors. The model observes a unified sequence for all samples, regardless of original field count.
2. Joint-Dataset Supervised Fine-Tuning
To mitigate dataset bias and improve cross-domain generalization, BERTector employs supervised fine-tuning on a hybrid dataset (MIX) assembled from four canonical sources: KDD-99, NSL-KDD, UNSW-NB15, and X-IIoTID, each contributing 100,000 sampled flows. All records are preprocessed into NetFlow-style tuples and tokenized with NSS-Tokenizer to uniform sequence length.
Class balancing is achieved via equal representation from each dataset; however, no per-category over- or under-sampling is applied. For the MIX set, every source's 100,000 samples are split 80/20 for train and validation, with separate test sets holding 10,000 previously unseen flows per source.
Supervised training uses the standard label-weighted cross-entropy loss on attack class labels: with L2-regularization over all trainable parameters : No explicit domain adaptation loss is included; implicit regularization stems from dropout (p=0.1), early stopping, and weight decay.
3. Low-Rank Adaptation (LoRA) for Efficient Fine-Tuning
BERTector integrates LoRA to enable parameter-efficient and memory-conscious training. Rather than updating all ~110M BERT parameters per deployment, LoRA restricts learnable updates to low-rank matrices within each projection and feed-forward layer. For a layer with weight , LoRA models the update as: At inference, the effective weight is . Only and are optimized; is fixed. This reduces trainable parameters by approximately two orders of magnitude, accelerates training on joint data, and supports frequent redeployment with minimal computational overhead. The typical LoRA rank used is .
4. Model Architecture and Training Regimen
BERTector adopts the standard BERT backbone with customizations for intrusion detection. Tokenized flows of length () are mapped via learned token and positional embeddings, then processed by twelve Transformer encoder layers (hidden size , heads, feed-forward size 3072). [CLS] token representation is pooled and passed through a dense layer for -class softmax prediction. LoRA is applied to all projection and FFN weights.
Training uses Adam with L2 weight decay, learning rate (constant), batch size 64, dropout of 0.1, and up to 10 epochs with early stopping (patience 3). No explicit scheduler or warmup is reported, and gradients are not clipped.
5. Benchmarking, Cross-Domain Generalization, and Robustness
Evaluation protocols include single-dataset training/testing (using each dataset independently) and joint-dataset (MIX) training with zero-shot testing on all four constituent test sets.
Key Metrics
- Accuracy:
- Precision:
- Recall:
- F1-score:
Exemplary results for NSL-KDD (single-dataset):
For MIX-trained BERTector, cross-dataset test accuracies are:
| Test Set | NSL-KDD | KDD-99 | UNSW-NB15 | X-IIoTID |
|---|---|---|---|---|
| Accuracy | 0.9903 | 0.9887 | 0.9610 | 0.9987 |
Adversarial Robustness
BERTector's resilience to additive noise was measured by perturbing NSL-KDD test flows with Poisson, uniform, Gaussian, and Laplace noise. Under Poisson noise, BERTector maintains , compared to classical ML and DL baselines dropping below 0.83. With stronger uniform, Gaussian, or Laplace perturbations, BERTector continues to outperform alternatives, with accuracy remaining 0.73–0.77 and F1 ≥ 0.78.
6. Interpretation, Impact, and Limitations
BERTector delivers generalization advantages through its traffic-aware NSS-Tokenizer, robust cross-domain representations from joint-dataset fine-tuning, and parameter-efficient adaptation via LoRA. These attributes collectively provide superior transfer and robustness properties over both conventional classifiers and single-dataset BERT approaches.
Notable limitations include the lack of treatment for online or streaming adaptation scenarios and the absence of explicit mechanisms for handling extreme class imbalance within rare attack categories. Future research directions may include incorporating dynamic domain-discrepancy losses or unsupervised anomaly streams to further improve detection of zero-day attacks.
7. Summary Table: Core Technical Components
| Component | Function | Key Innovations |
|---|---|---|
| NSS-Tokenizer | Tokenizes network flows for input to BERTector | Semantic protocol boundary detection, heterogeneity-agnostic |
| Joint-Dataset SFT | Multi-source supervised fine-tuning | Unified MIX dataset, cross-domain generalization |
| LoRA | Efficient parameter adaptation | Low-rank updates, two-order parameter reduction, fast redeployment |
These elements establish BERTector as a unified, high-performance IDS framework with demonstrated accuracy, cross-dataset generalization, and adversarial robustness (Hu et al., 14 Aug 2025).