HERA Flow Exporter

Updated 24 November 2025

HERA Flow Exporter is a modular pipeline that converts raw PCAP files into high-fidelity flow datasets using a robust five-tuple aggregation process with Argus.
It employs timeout-based flow separation and customizable feature extraction, ensuring bidirectional accuracy and reproducible results as validated on benchmark datasets.
The exporter produces labelled and unlabelled CSVs that support advanced machine-learning pipelines for network intrusion detection and telemetry research.

HERA Flow Exporter is a modular pipeline for transforming raw network packet captures (PCAPs) into high-fidelity, flow-oriented datasets with fully user-customizable features and integrated ground-truth labelling. By orchestrating Argus as its flow aggregation backend, HERA ensures bidirectional, timeout-based flow separation, robust 5-tuple clustering, high-throughput, and reproducibility. Its output—labelled or unlabelled CSVs and summary statistics—forms the basis for machine-learning pipelines in network intrusion detection and network telemetry research. Evaluation on widely used datasets, including UNSW-NB15 and CIC-IDS2017, demonstrates significant improvements in both flow consistency and model performance compared to prior exporters such as CICFlowMeter and Zeek (Pinto et al., 13 Jan 2025, Pinto et al., 2024).

1. Architectural Overview and Workflow

The HERA (Holistic nEtwork featuRes Aggregator) pipeline is structured as a sequence of orchestrated components:

Packet Ingestion: Supports any external PCAP capture tool. Input is one or more libpcap files.
Flow Aggregation: Invokes the Argus binary to aggregate packets into flows using standard 5-tuple keys—(Source IP, Destination IP, Source Port, Destination Port, Protocol)—and timeout logic.
Feature Extraction: Consumes Argus flow files with user-specified field selection via Argus clients (ra, racluster) and computes additional statistical or derived features in Python.
Dataset Labelling: Optionally merges flows with a user-provided ground-truth CSV, matching via time intervals, 5-tuple, and protocol.
Export: Writes flows to a native Argus .hera file, unlabelled and labelled CSVs (UTF-8, with explicit fieldsets), and a human-readable summary text file.

The canonical data path is:

PCAP files → Argus subprocess (via HERA) → .hera flow file → ra/racluster + feature computation → unlabelled CSV → (labelling step) → labelled CSV + summary.txt

Configuration options (via CLI or config file) permit customization of flow timeouts, field selection, output formats, and labelling behavior.

2. Flow Generation and Key Algorithms

HERA delegates flow keying, separation, and bidirectionality to Argus. Core definitions and algorithmic elements include:

Flow Definition: All packets sharing a (SrcIP, DstIP, SrcPort, DstPort, Protocol) 5-tuple until inactive or active timeout.
Formal Metrics:
- Flow duration: $\Delta t = t_{\mathrm{last}} - t_{\mathrm{first}}$
- Inter-arrival times: $\mathrm{IAT}_i = t_i - t_{i-1}$ for $i=2,\dots,N_{\mathrm{pkts}}$
- Throughput: $\mathit{Throughput} = \frac{\sum_{i=1}^N \mathit{size}_i}{\Delta t}$
Flow Separation:
- Inactive-timeout: default 15–60 s (no new packet closes flow)
- Active-timeout: user-tunable (1–60 s) for forced periodic closure
- Argus matches bi-directional flows, incorporates TCP sequence heuristics, and reorders out-of-order TCP packets.

A high-level pseudocode outline of the flow export logic as invoked by HERA is:

for each packet p in PCAP:
    key = (p.srcIP, p.dstIP, p.srcPort, p.dstPort, p.proto)
    revkey = (p.dstIP, p.srcIP, p.dstPort, p.srcPort, p.proto)
    if key in active_flows:
        F = active_flows[key]
    elif revkey in active_flows:
        F = active_flows[revkey]
    else:
        F = new Flow(key)
        active_flows[key] = F
    F.update(p)
    if now - F.last_pkt_time > InactiveTimeout or now - F.start_time > ActiveTimeout:
        output_and_remove(F)
flush_remaining_flows()

(Pinto et al., 13 Jan 2025, Pinto et al., 2024)

3. Feature Extraction and Configurability

Feature extraction operates on Argus’s bidirectional flows with user-driven field selection. HERA supports approximately 100–130 raw Argus fields, plus HERA-specific computed features.

Basic Identifiers: FlowID, rank, stime, ltime, saddr, daddr, sport, dport, proto
Volume Metrics: bytes, sbytes, dbytes, pkts, spkts, dpkts
Timing Features: dur, runtime, idle, iat_min, iat_max, iat_mean, iat_std
TCP Specific: flgs, tcpopt, synack, ackpsh, rst, urg, win
Statistical Measures: meansz, stdsz, head_pkt_size, tail_pkt_size, pkt_len_min/max/mean/std
Flow-level Counts: Ssaddr, Sdaddr (number of concurrent flows with same service/IP, HERA-calculated)

Custom fieldsets are specified via command-line flags (e.g., --features unsw) or, in future work, JSON/YAML configs. Presets for “all,” “default,” “UNSW-NB15,” “BoT-IoT,” “CIC-IDS2017” are provided. Post-processing in Python enables any arbitrary feature to be appended after Argus export.

4. Labelling Integration and Output Formats

Ground-truth labelling leverages CSV files specifying temporal intervals, protocol, and 5-tuple values. Each exported flow record is matched (by overlap in time, protocol, and 5-tuple components) to the provided ground-truth and annotated with class label; flows not matched are assigned “Benign.”

Supported outputs:

.hera: Argus binary flow file
.csv: Unlabelled flow records; optionally ground-truth labelled
.txt: Summary statistics (e.g., total flows, benign/malicious ratios)

NetFlow and IPFIX export is not native, but intermediate files can be further processed. Flows may be dropped based on management/orphan status via a CLI switch. Output size and rotation are planned as future extensions (Pinto et al., 13 Jan 2025).

5. Comparative Analysis with Existing Flow Exporters

Comparative evaluation focuses on consistency, completeness, throughput, and integration:

Exporter	Flow Features	Bidirectional Accuracy	Throughput
HERA/Argus	~100–130 (custom)	Robust (5-tuple, TCP-seq)	>100K flows/sec
CICFlowMeter	~84 (fixed/partial)	Known TCP segmentation bug	<100K flows/sec
Zeek (Bro)	~30 (app protos)	Coarse for flows	Variable

HERA, via Argus, avoids known mis-segmentation (notably in CICFlowMeter), preserves bidirectionality and precise timing, and directly integrates labelling. Mixing exporters or field schemas is shown to induce “semantic drift,” resulting in inconsistent ML outcomes (Pinto et al., 2024).

6. Validation, Performance, and Practical Impact

Validation on UNSW-NB15 and CIC-IDS2017 datasets establishes ground-truth flow alignment and machine-learning effectiveness:

UNSW-NB15, Day 2:
- HERA: 777,096 flows (vs. 1,140,045 in original; delta ascribed to Zeek post-processing)
- Labelling: Merged via time-interval and 5-tuple to assign attack/benign
Classifier Performance (selected features, UNSW-NB15):

Algorithm	Precision	Recall	F1-score	Accuracy
RandomForestClassifier	0.98	0.98	0.98	0.9894
LGBMClassifier	0.97	0.97	0.97	0.9890
XGBClassifier	0.97	0.97	0.97	0.9888

Unsupervised K-Means (silhouette scores):
- HERA’s CSV: 0.7258
- Researcher’s CSV: 0.6879
- Original UNSW-NB15: 0.5061

Performance is governed by Argus throughput, routinely exceeding 100,000 flows/sec with sub-5% Python wrapper overhead and <200 MB RAM usage. Parallelization, chunked I/O, and caching are leveraged for scalability (Pinto et al., 13 Jan 2025, Pinto et al., 2024).

7. Best Practices and Recommended Use

Empirical findings show that consistent exporter usage and feature schema are foundational for robust ML-based intrusion detection. Recommendations include:

Use fixed feature schemas and exporters across all PCAPs to avoid semantic drift.
Select flow interval timeouts suited to application: short for anomaly detection, longer for attack summarization.
Minimize exported features to reduce compute and avoid redundancy.
Integrate labelling at the point of flow export to ensure dataset-ground-truth alignment.
Benchmark end-to-end throughput; >100K flows/sec is feasible.
Validate class/feature distributions against published values to detect exporter or dataset anomalies.

A plausible implication is that tight integration of Argus-based aggregation, post-processing, and labelling—central to HERA’s design—produces reproducible, semantically consistent flow datasets, directly supporting advanced ML pipelines in network security research (Pinto et al., 13 Jan 2025, Pinto et al., 2024).