BERT-flow: BERT-based Network Intrusion Detection

Updated 21 February 2026

The paper introduces a novel method that transforms network flows into contextual embeddings using a modified BERT architecture for enhanced domain adaptation.
It tokenizes and discretizes core NetFlow features and constructs 768-dimensional representations without positional encodings to suit unordered flow sequences.
Experimental results on CIDDS datasets demonstrate significant gains in accuracy and F1-score, validating its robust performance in zero-shot settings.

BERT-flow is a network intrusion detection method that leverages the Bidirectional Encoder Representations from Transformers (BERT) framework to encode sequences of network flows into contextualized embeddings for the purpose of robust domain adaptation in Network Intrusion Detection Systems (NIDS). It frames flow-sequence modeling analogously to natural language sequences, allowing the learning of flow-contextual representations that generalize across different network environments through an unsupervised language-modeling pre-training phase and subsequent fine-tuning.

1. Flow Preprocessing and Feature Tokenization

BERT-flow operates on NetFlow records, performing an explicit feature selection and tokenization process. The approach discards IP addresses and timestamps, retaining six core features:

Duration
Transport protocol (Proto)
Source port (Src Pt)
Destination port (Dst Pt)
Number of packets (Packets)
Number of bytes (Bytes)

Continuous features (Duration, Ports, Packets, Bytes) are discretized into integer bins. For example, Duration is binned into intervals such as [0.001, 0.002, ..., 100, ∞], and Ports into [50, 60, ..., 60000, ∞]. If flag features are utilized, they are mapped to integers in [0…63]. Each individual flow $i$ is represented as a 6-tuple of integers: $x_i = (x_i^1, x_i^2, ..., x_i^6)$ . Each $x_i^j$ is embedded via a learned table $E_j$ into a 128-dimensional vector; these are concatenated to yield a 768-dimensional embedding: $e_i = \operatorname{concat}(E_1(x_i^1), E_2(x_i^2), ..., E_6(x_i^6)) \in \mathbb{R}^{768}$ .

2. Embedding Layer Design and Sequence Construction

In contrast to standard BERT, BERT-flow does not use positional encodings or special tokens ("[CLS]", "[SEP]") because flow sequences lack a strict grammatical order. The input to the model is a sequence of $N$ flow embeddings, stacked into $z^{(0)} = [\,e_1;\, e_2;\, ...;\, e_N\,] \in \mathbb{R}^{N\times 768}$ . During training, the sequence length $N$ is set to 128; for evaluation, $N=1024$ .

3. BERT Encoder Architecture

BERT-flow modifies the canonical transformer configuration, using a single encoder layer $(L=1)$ and a single attention head $(H=1)$ , with a hidden size of $d=768$ . The self-attention mechanism on input $Z^{(\ell-1)}$ operates according to:

$Q = Z^{(\ell-1)} W_Q,\quad K = Z^{(\ell-1)} W_K,\quad V = Z^{(\ell-1)} W_V$

$\text{Attention}(Q,K,V) = \operatorname{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right)V$

Each transformer block yields: $A^{(\ell)} = \operatorname{LayerNorm}\left( Z^{(\ell-1)} + \text{Attention}(Q,K,V)\right)$

$Z^{(\ell)} = \operatorname{LayerNorm}\left( A^{(\ell)} + \operatorname{MLP}(A^{(\ell)})\right)$

The output $Z^{(L)}$ contains contextualized per-flow representations $h_i \in \mathbb{R}^{768}$ .

4. Classification Head and Training Losses

Each final per-flow embedding $h_i$ is passed through a shared linear projection and softmax activation to yield per-flow label predictions:

$\hat{y}_i = \operatorname{softmax}(W\cdot h_i + b) \in \mathbb{R}^2$

Fine-tuning minimizes the binary (or multi-class) cross-entropy loss:

$L_{CE} = -\sum_{i=1}^N \sum_{c \in \{\text{benign},\,\text{malicious}\}} y_{i,c} \log \hat{y}_{i,c}$

Pre-training uses Masked Language Modeling (MLM) applied to benign flows:

$L_{MLM} = -\sum_{\text{masked positions}} \log P(x_{\text{masked}} \mid \text{context})$

The training schedule:

Pre-train BERT on unlabeled benign flow sequences (MLM), 400 iterations.
Freeze BERT weights, train classification head, 1100 iterations.
Unfreeze full model, jointly fine-tune on labeled data, 400 iterations.

5. Domain Adaptation Approach

BERT-flow achieves domain adaptation by leveraging unsupervised MLM pre-training exclusively on benign flows, followed by fine-tuning with labeled source-domain sequences. Unlike adversarial or multi-task approaches, no explicit domain confusion or adaptation losses are used. By learning to contextually predict network flows from their local neighborhood, the method enables the extraction of generalized, transferable flow representations. Zero-shot deployment is performed by directly applying the trained model to target domain flows without further adaptation.

6. Experimental Methodology and Results

BERT-flow is evaluated on the CIDDS-001 OpenStack (internal, source domain), CIDDS-001 External Server (external), and CIDDS-002 OpenStack datasets. Key aspects of the experimental setup:

Sequence length: 128 (train), 1024 (test)
Batch size: 512
Optimizer: Adam, learning rate $1 \times 10^{-5}$
Metrics: Accuracy, F1-score, Precision, Recall
Baselines: Energy-based Flow Classifier (EFC), Decision Tree, KNN, SVM, Naive Bayes, AdaBoost, Random Forest, plain MLP

Quantitative results, with source training on internal CIDDS-001, are summarized below:

Test Domain	Method	Accuracy	F1-score
CIDDS-001 External	BERT-flow	0.9078	0.9311
	EFC	0.8659	0.9044
CIDDS-002	BERT-flow	0.9913	0.8578
	EFC	0.9084	0.3317

Performance gains are preserved even when training on a balanced subset of the source domain. These results indicate significantly superior zero-shot domain adaptation compared to classical ML and the EFC baseline (Nguyen et al., 2023).

7. Context and Implications

BERT-flow demonstrates that contextual embedding of network flows—treating sequences analogously to natural language text—enables more robust and generalizable detection of intrusions across domains. The method’s pre-training regime allows it to learn general sequence-level patterns beyond the capacity of traditional classifiers, addressing the widely observed poor domain adaptation of conventional ML-based NIDS. A plausible implication is that the incorporation of flow context and pre-trained sequence models provides a new methodological avenue for future intrusion detection research, emphasizing unsupervised contextual signal extraction and minimizing the need for extensive target domain supervision.

Markdown Report Issue Upgrade to Chat

References (1)

A Method for Network Intrusion Detection Using Flow Sequence and BERT Framework (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BERT-flow.

BERT-flow: BERT-based Network Intrusion Detection

1. Flow Preprocessing and Feature Tokenization

2. Embedding Layer Design and Sequence Construction

3. BERT Encoder Architecture

4. Classification Head and Training Losses

5. Domain Adaptation Approach

6. Experimental Methodology and Results

7. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BERT-flow: BERT-based Network Intrusion Detection

1. Flow Preprocessing and Feature Tokenization

2. Embedding Layer Design and Sequence Construction

3. BERT Encoder Architecture

4. Classification Head and Training Losses

5. Domain Adaptation Approach

6. Experimental Methodology and Results

7. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research