Papers
Topics
Authors
Recent
Search
2000 character limit reached

BERT-flow: BERT-based Network Intrusion Detection

Updated 21 February 2026
  • The paper introduces a novel method that transforms network flows into contextual embeddings using a modified BERT architecture for enhanced domain adaptation.
  • It tokenizes and discretizes core NetFlow features and constructs 768-dimensional representations without positional encodings to suit unordered flow sequences.
  • Experimental results on CIDDS datasets demonstrate significant gains in accuracy and F1-score, validating its robust performance in zero-shot settings.

BERT-flow is a network intrusion detection method that leverages the Bidirectional Encoder Representations from Transformers (BERT) framework to encode sequences of network flows into contextualized embeddings for the purpose of robust domain adaptation in Network Intrusion Detection Systems (NIDS). It frames flow-sequence modeling analogously to natural language sequences, allowing the learning of flow-contextual representations that generalize across different network environments through an unsupervised language-modeling pre-training phase and subsequent fine-tuning.

1. Flow Preprocessing and Feature Tokenization

BERT-flow operates on NetFlow records, performing an explicit feature selection and tokenization process. The approach discards IP addresses and timestamps, retaining six core features:

  1. Duration
  2. Transport protocol (Proto)
  3. Source port (Src Pt)
  4. Destination port (Dst Pt)
  5. Number of packets (Packets)
  6. Number of bytes (Bytes)

Continuous features (Duration, Ports, Packets, Bytes) are discretized into integer bins. For example, Duration is binned into intervals such as [0.001, 0.002, ..., 100, ∞], and Ports into [50, 60, ..., 60000, ∞]. If flag features are utilized, they are mapped to integers in [0…63]. Each individual flow ii is represented as a 6-tuple of integers: xi=(xi1,xi2,...,xi6)x_i = (x_i^1, x_i^2, ..., x_i^6). Each xijx_i^j is embedded via a learned table EjE_j into a 128-dimensional vector; these are concatenated to yield a 768-dimensional embedding: ei=concat(E1(xi1),E2(xi2),...,E6(xi6))R768e_i = \operatorname{concat}(E_1(x_i^1), E_2(x_i^2), ..., E_6(x_i^6)) \in \mathbb{R}^{768}.

2. Embedding Layer Design and Sequence Construction

In contrast to standard BERT, BERT-flow does not use positional encodings or special tokens ("[CLS]", "[SEP]") because flow sequences lack a strict grammatical order. The input to the model is a sequence of NN flow embeddings, stacked into z(0)=[e1;e2;...;eN]RN×768z^{(0)} = [\,e_1;\, e_2;\, ...;\, e_N\,] \in \mathbb{R}^{N\times 768}. During training, the sequence length NN is set to 128; for evaluation, N=1024N=1024.

3. BERT Encoder Architecture

BERT-flow modifies the canonical transformer configuration, using a single encoder layer (L=1)(L=1) and a single attention head (H=1)(H=1), with a hidden size of d=768d=768. The self-attention mechanism on input Z(1)Z^{(\ell-1)} operates according to:

Q=Z(1)WQ,K=Z(1)WK,V=Z(1)WVQ = Z^{(\ell-1)} W_Q,\quad K = Z^{(\ell-1)} W_K,\quad V = Z^{(\ell-1)} W_V

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q,K,V) = \operatorname{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right)V

Each transformer block yields: A()=LayerNorm(Z(1)+Attention(Q,K,V))A^{(\ell)} = \operatorname{LayerNorm}\left( Z^{(\ell-1)} + \text{Attention}(Q,K,V)\right)

Z()=LayerNorm(A()+MLP(A()))Z^{(\ell)} = \operatorname{LayerNorm}\left( A^{(\ell)} + \operatorname{MLP}(A^{(\ell)})\right)

The output Z(L)Z^{(L)} contains contextualized per-flow representations hiR768h_i \in \mathbb{R}^{768}.

4. Classification Head and Training Losses

Each final per-flow embedding hih_i is passed through a shared linear projection and softmax activation to yield per-flow label predictions:

y^i=softmax(Whi+b)R2\hat{y}_i = \operatorname{softmax}(W\cdot h_i + b) \in \mathbb{R}^2

Fine-tuning minimizes the binary (or multi-class) cross-entropy loss:

LCE=i=1Nc{benign,malicious}yi,clogy^i,cL_{CE} = -\sum_{i=1}^N \sum_{c \in \{\text{benign},\,\text{malicious}\}} y_{i,c} \log \hat{y}_{i,c}

Pre-training uses Masked Language Modeling (MLM) applied to benign flows:

LMLM=masked positionslogP(xmaskedcontext)L_{MLM} = -\sum_{\text{masked positions}} \log P(x_{\text{masked}} \mid \text{context})

The training schedule:

  • Pre-train BERT on unlabeled benign flow sequences (MLM), 400 iterations.
  • Freeze BERT weights, train classification head, 1100 iterations.
  • Unfreeze full model, jointly fine-tune on labeled data, 400 iterations.

5. Domain Adaptation Approach

BERT-flow achieves domain adaptation by leveraging unsupervised MLM pre-training exclusively on benign flows, followed by fine-tuning with labeled source-domain sequences. Unlike adversarial or multi-task approaches, no explicit domain confusion or adaptation losses are used. By learning to contextually predict network flows from their local neighborhood, the method enables the extraction of generalized, transferable flow representations. Zero-shot deployment is performed by directly applying the trained model to target domain flows without further adaptation.

6. Experimental Methodology and Results

BERT-flow is evaluated on the CIDDS-001 OpenStack (internal, source domain), CIDDS-001 External Server (external), and CIDDS-002 OpenStack datasets. Key aspects of the experimental setup:

  • Sequence length: 128 (train), 1024 (test)
  • Batch size: 512
  • Optimizer: Adam, learning rate 1×1051 \times 10^{-5}
  • Metrics: Accuracy, F1-score, Precision, Recall
  • Baselines: Energy-based Flow Classifier (EFC), Decision Tree, KNN, SVM, Naive Bayes, AdaBoost, Random Forest, plain MLP

Quantitative results, with source training on internal CIDDS-001, are summarized below:

Test Domain Method Accuracy F1-score
CIDDS-001 External BERT-flow 0.9078 0.9311
EFC 0.8659 0.9044
CIDDS-002 BERT-flow 0.9913 0.8578
EFC 0.9084 0.3317

Performance gains are preserved even when training on a balanced subset of the source domain. These results indicate significantly superior zero-shot domain adaptation compared to classical ML and the EFC baseline (Nguyen et al., 2023).

7. Context and Implications

BERT-flow demonstrates that contextual embedding of network flows—treating sequences analogously to natural language text—enables more robust and generalizable detection of intrusions across domains. The method’s pre-training regime allows it to learn general sequence-level patterns beyond the capacity of traditional classifiers, addressing the widely observed poor domain adaptation of conventional ML-based NIDS. A plausible implication is that the incorporation of flow context and pre-trained sequence models provides a new methodological avenue for future intrusion detection research, emphasizing unsupervised contextual signal extraction and minimizing the need for extensive target domain supervision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BERT-flow.