BATADAL Benchmark Data

Updated 23 December 2025

BATADAL benchmark data is a simulated multivariate time-series dataset from water distribution systems, featuring normal and adversarial states generated via an EPANET hydraulic model.
It includes 43 sensor streams recorded hourly over 1.5 years, with detailed annotations of rare cyber-attack intervals and advanced feature engineering techniques applied.
Benchmark findings highlight the effectiveness of ensemble methods, particularly stacking RF, XGBoost, and LSTM, for addressing class imbalance and capturing temporal dependencies.

BATADAL benchmark data constitutes a simulated, multivariate time series derived from a medium-sized water distribution system (WDS) under Supervisory Control and Data Acquisition (SCADA) supervision. Designed explicitly for cyber-attack detection research, the dataset reproduces both normal and adversarial operational states using an EPANET hydraulic model, offering 43 physical sensor streams with precisely annotated attack intervals. Its primary role is to serve as a robust testbed for developing and rigorously benchmarking detection algorithms that address rare, subtle, and temporally correlated cyber-physical threats to critical infrastructure.

1. Simulation Environment and Data Structure

The BATADAL dataset was generated by simulating all WDS hydraulics and controls in EPANET. The modeled system encompasses pumps, tanks, valves, and junctions, with sensor outputs reflecting the physical dynamics governed by EPANET physics. Core observed variables include:

Level sensors (7): L_T1–L_T7 (tank/reservoir levels)
Flow sensors (12): F_PU1–F_PU11 (pump flows), F_V2 (valve flow)
Pressure sensors (13): P_J269, P_J280, P_J289, P_J300, P_J302, P_J306, P_J307, P_J317, P_J415, P_J422, P_J256, P_J14, P_J302
Pump speeds (11): S_PU1–S_PU11

Data are chronologically indexed in hourly increments over 12,938 samples (≈1.5 years). Each row in the merged dataset contains the datetime, 43 sensor readings, and a binary Attack_Flag label (0 = normal, 1 = attack).

2. Attack Design and Operational Scenarios

Authors embedded a diverse set of cyber-physical attack types by direct EPANET manipulation:

Valve closures: Sudden or ramped closure of specific valves (e.g., V2)
Pump manipulations: Speed reductions or shutdowns (PUx pumps)
Sensor tampering: Introduction of time-varying additive biases to level, flow, or pressure sensors
Flow diversions/leaks: Imposed false demand profiles to simulate water theft or leak scenarios

Attack periods are interspersed among normal operation, with attacks comprising only 488 out of 12,938 samples (3.77%). The attack intervals vary from a single hour to multi-day episodes, with stealthy attacks implemented via slow ramping (e.g., pump speed adjusted by 1–2% per hour) and subtle, noise-level sensor biases. Many attack campaigns are multi-stage and spatially distributed across the system zones.

3. Data Characteristics and Analytical Challenges

Temporal and Multivariate Aspects

Sampling: Hourly over 12,938 timestamps (N) and 43 numeric variables plus index and target (D), yielding 44 predictors per sample.
Temporal structure: Pronounced autocorrelation at 1–3 hour lags in time series, notably in pump-related variables. Physical coupling (e.g., between S_PUx and F_PUx) is confirmed by Pearson correlations up to |ρ| ≈ 0.9.
Multicollinearity: Sensor clusters, particularly among spatially proximate pressure junctions, exhibit high redundancy.

Class Imbalance

Normal operation is dominant (12,450 samples, 96.23%) versus rare attacks (488 samples, 3.77%), resulting in an approximate imbalance ratio of 25:1. The rarity and subtlety of embedded attacks introduce significant class imbalance and detection difficulties.

4. Preprocessing, Feature Engineering, and Resampling

No missing values were present, and thus no data imputation was required. Standardized procedures included:

Indexing: Parsing DATETIME as the time-series index for alignment.
Feature construction:
- Lag features: $x_{t-1}$ , $x_{t-3}$
- Difference: $\Delta_{1t} = x_t - x_{t-1}$
- 5-hour rolling window statistics: $\mu_{5t} = \text{mean}(x_{t-4\ldots t})$ , $\sigma_{5t} = \text{std}(x_{t-4\ldots t})$
- This expansion increased the feature set from 43 to 93 predictors.
Normalization: StandardScaler (zero mean, unit variance) for Random Forest and eXtreme Gradient Boosting (XGBoost); MinMaxScaler ([0,1] range) for Long Short-Term Memory networks (LSTM).
Dataset splitting: Stratified 80%/20% partition (10,350 train, 2,588 validation), preserving the Attack_Flag ratio.

Handling Imbalance: SMOTE

The Synthetic Minority Oversampling Technique (SMOTE) was applied exclusively to the training set. For each minority instance $x_i$ and its $k$ -nearest neighbor $x_{NN}$ , new synthetic points are created: $x' = x_i + \lambda(x_{NN} - x_i)$ , with $\lambda \sim \text{Uniform}(0,1)$ . Post-SMOTE, the training set was balanced to ~9,956 samples per class; validation data remained unbalanced to reflect operational reality.

5. Evaluation Protocols and Metrics

Detection models were evaluated using standard metrics for binary classification in the attack class:

Precision: $\text{precision} = TP / (TP + FP)$
Recall: $\text{recall} = TP / (TP + FN)$
F₁-score:

$F_1 = 2\, \frac{\mathrm{precision} \times \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$

ROC-AUC: $AUC = \int_0^1 TPR(FPR^{-1}(u)) du$ , quantifying the area under the true positive vs. false positive rate curve
Thresholding: Default “attack” prediction at $P(\mathrm{attack}) \geq 0.5$ .

6. Benchmark Findings and Comparative Analysis

Key results from validation (n=2,588) demonstrate the relative strengths of different learning architectures:

Model	F₁-score	AUC	Precision	Recall
Random Forest	0.6051	0.9548	0.6082	0.6020
XGBoost	0.7470	0.9684	0.9118	0.6327
LSTM	0.0000	0.4460	0.0000	0.0000
RF+XGB+LSTM	0.7205	0.9723	0.9206	0.5918

The hybrid ensemble (stacking Random Forest, XGBoost, and LSTM using logistic regression) achieves an F₁-score of 0.7205 and AUC of 0.9723, outperforming any single model.
RF and XGB are notably resilient to imbalanced, tabular features; LSTM alone is ineffective in the minority attack class due to class rarity but contributes positively within a heterogeneous stack.
The confusion matrix for the hybrid ensemble yielded 2,485 true negatives, 5 false positives, 40 false negatives, and 58 true positives.
SHAP-based feature analysis highlights time-aware engineered variables (lags, rolling means) as constituting ~40% of the top 20 predictors, underscoring the necessity of temporal statistics.

A plausible implication is that heterogeneous ensembles leveraging both static and sequential models, when combined with temporal feature augmentation and targeted resampling strategies, provide superior robustness to stealthy, temporally correlated attacks in multivariate industrial process data (Ahmed, 16 Dec 2025).

7. Context, Limitations, and Research Applications

BATADAL is a standard resource for water system intrusion detection, notable for its realism: hourly sampling, longer durations, and the inclusion of sophisticated, hard-to-detect cyber-physical attacks. It is directly suitable for binary and sequential anomaly detection studies in industrial control domains.

Key limitations include the artificiality of the simulated environment, potential lack of coverage for particularly sophisticated or coordinated attack types beyond those implemented, and the persistence of imbalanced, redundant sensor features. The public availability and detailed labeling allow for meaningful benchmarking of new time-series detection algorithms—preferably those that explicitly address class rarity, feature engineering, and temporal dependencies in critical infrastructure over extended horizons.

PDF Markdown Chat (Pro)

References (1)

Hybrid Ensemble Method for Detecting Cyber-Attacks in Water Distribution Systems Using the BATADAL Dataset (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BATADAL Benchmark Data.