BERT-flow: BERT-based Network Intrusion Detection
- The paper introduces a novel method that transforms network flows into contextual embeddings using a modified BERT architecture for enhanced domain adaptation.
- It tokenizes and discretizes core NetFlow features and constructs 768-dimensional representations without positional encodings to suit unordered flow sequences.
- Experimental results on CIDDS datasets demonstrate significant gains in accuracy and F1-score, validating its robust performance in zero-shot settings.
BERT-flow is a network intrusion detection method that leverages the Bidirectional Encoder Representations from Transformers (BERT) framework to encode sequences of network flows into contextualized embeddings for the purpose of robust domain adaptation in Network Intrusion Detection Systems (NIDS). It frames flow-sequence modeling analogously to natural language sequences, allowing the learning of flow-contextual representations that generalize across different network environments through an unsupervised language-modeling pre-training phase and subsequent fine-tuning.
1. Flow Preprocessing and Feature Tokenization
BERT-flow operates on NetFlow records, performing an explicit feature selection and tokenization process. The approach discards IP addresses and timestamps, retaining six core features:
- Duration
- Transport protocol (Proto)
- Source port (Src Pt)
- Destination port (Dst Pt)
- Number of packets (Packets)
- Number of bytes (Bytes)
Continuous features (Duration, Ports, Packets, Bytes) are discretized into integer bins. For example, Duration is binned into intervals such as [0.001, 0.002, ..., 100, ∞], and Ports into [50, 60, ..., 60000, ∞]. If flag features are utilized, they are mapped to integers in [0…63]. Each individual flow is represented as a 6-tuple of integers: . Each is embedded via a learned table into a 128-dimensional vector; these are concatenated to yield a 768-dimensional embedding: .
2. Embedding Layer Design and Sequence Construction
In contrast to standard BERT, BERT-flow does not use positional encodings or special tokens ("[CLS]", "[SEP]") because flow sequences lack a strict grammatical order. The input to the model is a sequence of flow embeddings, stacked into . During training, the sequence length is set to 128; for evaluation, .
3. BERT Encoder Architecture
BERT-flow modifies the canonical transformer configuration, using a single encoder layer and a single attention head , with a hidden size of . The self-attention mechanism on input operates according to:
Each transformer block yields:
The output contains contextualized per-flow representations .
4. Classification Head and Training Losses
Each final per-flow embedding is passed through a shared linear projection and softmax activation to yield per-flow label predictions:
Fine-tuning minimizes the binary (or multi-class) cross-entropy loss:
Pre-training uses Masked Language Modeling (MLM) applied to benign flows:
The training schedule:
- Pre-train BERT on unlabeled benign flow sequences (MLM), 400 iterations.
- Freeze BERT weights, train classification head, 1100 iterations.
- Unfreeze full model, jointly fine-tune on labeled data, 400 iterations.
5. Domain Adaptation Approach
BERT-flow achieves domain adaptation by leveraging unsupervised MLM pre-training exclusively on benign flows, followed by fine-tuning with labeled source-domain sequences. Unlike adversarial or multi-task approaches, no explicit domain confusion or adaptation losses are used. By learning to contextually predict network flows from their local neighborhood, the method enables the extraction of generalized, transferable flow representations. Zero-shot deployment is performed by directly applying the trained model to target domain flows without further adaptation.
6. Experimental Methodology and Results
BERT-flow is evaluated on the CIDDS-001 OpenStack (internal, source domain), CIDDS-001 External Server (external), and CIDDS-002 OpenStack datasets. Key aspects of the experimental setup:
- Sequence length: 128 (train), 1024 (test)
- Batch size: 512
- Optimizer: Adam, learning rate
- Metrics: Accuracy, F1-score, Precision, Recall
- Baselines: Energy-based Flow Classifier (EFC), Decision Tree, KNN, SVM, Naive Bayes, AdaBoost, Random Forest, plain MLP
Quantitative results, with source training on internal CIDDS-001, are summarized below:
| Test Domain | Method | Accuracy | F1-score |
|---|---|---|---|
| CIDDS-001 External | BERT-flow | 0.9078 | 0.9311 |
| EFC | 0.8659 | 0.9044 | |
| CIDDS-002 | BERT-flow | 0.9913 | 0.8578 |
| EFC | 0.9084 | 0.3317 |
Performance gains are preserved even when training on a balanced subset of the source domain. These results indicate significantly superior zero-shot domain adaptation compared to classical ML and the EFC baseline (Nguyen et al., 2023).
7. Context and Implications
BERT-flow demonstrates that contextual embedding of network flows—treating sequences analogously to natural language text—enables more robust and generalizable detection of intrusions across domains. The method’s pre-training regime allows it to learn general sequence-level patterns beyond the capacity of traditional classifiers, addressing the widely observed poor domain adaptation of conventional ML-based NIDS. A plausible implication is that the incorporation of flow context and pre-trained sequence models provides a new methodological avenue for future intrusion detection research, emphasizing unsupervised contextual signal extraction and minimizing the need for extensive target domain supervision.