LEAD: Energy Anomaly Detection Dataset
- Large-scale Energy Anomaly Detection (LEAD) dataset is a comprehensive, annotated benchmark for identifying anomalies in non-residential building energy consumption.
- It offers high-frequency time-series data with detailed ground-truth labels for both point and sequential anomalies using a rigorous manual annotation process.
- The dataset supports evaluation of various algorithms—including adversarial GAN-LSTM approaches—by providing extensive building metadata and standardized preprocessing workflows.
The Large-scale Energy Anomaly Detection (LEAD) dataset is a rigorously annotated benchmark for energy anomaly detection in commercial buildings, designed to accelerate research into advanced monitoring and management of building-level electricity consumption. Originating from the ASHRAE Great Energy Predictor III (“GEPIII”) corpus and publicly available on platforms such as Kaggle and GitHub, LEAD provides comprehensive, high-frequency, time-series data, including detailed ground-truth anomaly labels and extensive building-level metadata, suitable for evaluation and comparison of classical and deep learning anomaly detection methods (Gulati et al., 2022, Nia et al., 14 Jan 2026).
1. Dataset Structure and Statistical Overview
LEAD comprises 1,413 distinct smart electricity meter time series, each spanning an entire leap year (2016), resulting in 8,784 hourly samples per meter and ∼12.41 million raw data points. After excluding missing values, 12,060,910 hourly readings are manually inspected and annotated. The dataset focuses exclusively on electricity meters from 1,636 non-residential buildings across 16 sites. A reduced benchmarking subset (“Kaggle LEAD”) includes 406 buildings (8,760 samples per building; 200 for training, 206 for testing) (Gulati et al., 2022, Nia et al., 14 Jan 2026). Significant periodicity is observed in all meters—pronounced diurnal (workday) cycles and weekly seasonality, with energy consumption mean and variance highly variable across buildings (tens to thousands of kW). Data undergo log transformation () and per-feature z-score normalization prior to labeling and model training.
2. Annotation Methodology and Label Taxonomy
Anomalies in LEAD are defined as either:
- Point anomaly: A single-hour reading that strongly deviates from its immediate neighbors or expected profile.
- Sequential (collective) anomaly: A contiguous block of hours (potentially spanning multiple days) that disrupts regular daily/weekly usage patterns.
Annotation utilizes a dedicated web-based interface. Three expert annotators spend approximately 100 man-hours manually reviewing all hourly points. For each 24-hour window, standard shapes are verified, and anomalous spikes/suppressions flagged; in the absence of clear periodicity, aggregate daily totals are compared with adjacent days. All anomaly labels are binary (anomalous/normal) and further classified as “point” or “sequential.” Total anomalies: 199,640 hourly points (∼1.65% of inspected readings); 45% are point anomalies, 55% sequential. Anomalies are distributed across 1,226 of the 1,413 meters (∼87%). Annotation agreement is substantial (Fleiss’ κ > 0.75), and post-hoc random audits confirm >98% label consistency (Gulati et al., 2022).
3. Data Format, Metadata, and Accessibility
LEAD data is organized in the following schema:
| Field | Description | Format |
|---|---|---|
| meter_id | Unique integer for each meter | Integer |
| timestamp | Reading time (YYYY-MM-DD HH:MM:SS) | ISO Date/Time |
| meter_reading | Original hourly consumption (kW) | Float |
| anomaly_label | 0 = normal, 1 = anomaly | Integer |
| anomaly_type | point / sequential | Categorical |
Additional metadata includes building characteristics (building_id, site_id, primary_use, square_feet, year_built, floor_count), with weather and time features either present or derivable. All files are CSV, with a JSON manifest for navigation (Gulati et al., 2022). Access is provided under CC-BY-4.0 via GitHub (https://github.com/samy101/lead-dataset) and as a Kaggle competition for GAN-LSTM studies (Nia et al., 14 Jan 2026).
4. Benchmark Algorithms and Evaluation Protocols
LEAD1.0 systematically evaluates eight classical anomaly detection algorithms (PyOD suite) on flattened 24-hour windows, incorporating time-of-day and calendar features:
- CBLOF: Scores samples via clustering, flagging small-cluster points.
- Feature Bagging: LOF models built on randomly selected feature subsets, scores aggregated.
- HBOS: Density estimation via histograms, scores as inverse product.
- Isolation Forest: Random tree ensemble, anomaly score via isolation path length.
- KNN: Score as , distance to k-th neighbor.
- LOF: Local reachability density ratio: .
- MCD: Robust Mahalanobis distance using minimal covariance subset.
- OC-SVM: Decision boundary in feature space: , with anomalies having .
All experiments treat each window as anomalous if any contained hour is labeled anomalous. Performance metrics include Precision (), Recall (), and () (Gulati et al., 2022).
Classical Algorithm Performance
| Model | Precision | Recall | F |
|---|---|---|---|
| CBLOF | 0.900 | 0.277 | 0.425 |
| Feature Bagging | 0.899 | 0.279 | 0.424 |
| KNN | 0.902 | 0.284 | 0.431 |
| HBOS | 0.896 | 0.258 | 0.397 |
| Isolation Forest | 0.895 | 0.270 | 0.413 |
| OC-SVM | 0.899 | 0.276 | 0.421 |
| LOF | 0.900 | 0.281 | 0.426 |
| MCD | 0.901 | 0.276 | 0.422 |
All models achieve high precision (∼0.90) but exhibit low recall (∼0.26–0.28), indicating conservative detection characteristics. KNN yields the highest score (0.431) among classical methods (Gulati et al., 2022).
5. Advanced Deep Learning Benchmarks
Subsequent work introduces a deep adversarial anomaly detection pipeline—GAN-LSTM—evaluated on the Kaggle LEAD subset (Nia et al., 14 Jan 2026). The GAN-LSTM architecture employs:
- Generator: Three stacked LSTM layers (32→64→128), mapping 100-dim noise vectors () to 60-step synthetic sequences.
- Discriminator: One LSTM layer (100 units), dense output with sigmoid activation, recording realness score and feature intermediate .
- Adversarial losses: Binary cross-entropy for G/D.
- Test-time scoring: Latent-space inversion (TAnoGAN style), using (), yielding anomaly score .
A 60-hour windowing strategy is adopted to balance daily/multi-day patterns and detection latency. Per-building standardization is enforced to mitigate cross-building scale disparities. The anomaly threshold on is chosen via maximization on held-out validation data (Nia et al., 14 Jan 2026).
Deep Model Performance
| Method | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Isolation Forest | 57.83% | 0.54 | 0.51 | 0.52 | 0.59 |
| One-Class SVM | 60.17% | 0.57 | 0.56 | 0.56 | 0.61 |
| LSTM Autoencoder | 65.42% | 0.61 | 0.59 | 0.60 | 0.70 |
| Attention LSTM Autoencoder | 68.93% | 0.65 | 0.63 | 0.64 | 0.74 |
| Variational Autoencoder | 67.58% | 0.63 | 0.61 | 0.62 | 0.72 |
| TAnoGAN | 71.86% | 0.69 | 0.66 | 0.67 | 0.77 |
| GAN-LSTM | 89.73% | 0.88 | 0.89 | 0.89 | 0.83 |
GAN-LSTM demonstrates substantial improvement over all baselines (; ROC-AUC = 0.83; accuracy 89.73%), validating the potential of adversarial temporal modeling in energy anomaly detection (Nia et al., 14 Jan 2026).
6. Protocols, Practical Workflows, and Known Limitations
Recommended usage comprises:
- Ingesting raw data and relevant metadata (CSV/JSON).
- Computing features: hour, weekday, month.
- Applying transformation and z-score normalization.
- Windowing: 24-hour (classical) or 60-hour (GAN-LSTM) intervals, non-overlapping.
- Annotating windows: the presence of any anomalous hour marks window as anomalous.
- Data splits: non-anomalous windows (train/val/test, 80/10/10), anomalous windows (10/20/70).
- Model training/testing, threshold tuning on validation subset, reporting standard metrics (, , ) (Gulati et al., 2022, Nia et al., 14 Jan 2026).
Known limitations include:
- Exclusivity to electricity meters—no annotation for water, steam, or chilled water.
- Manual labeling may introduce annotator bias, despite high inter-annotator agreement.
- Dataset covers a single leap year (seasonality beyond 12 months is absent).
- Non-residential geographical scope—generalization to residential or other climates is untested.
- Binary anomaly labels; detailed event-type categorization (“meter fault” vs. “equipment shutdown”) not distinguished (Gulati et al., 2022).
7. Research Context and Strategic Implications
The LEAD dataset furnishes an authoritative benchmark for energy anomaly detection, enabling rigorous evaluation of both classical statistical and state-of-the-art deep learning methods. Its scale, annotation quality, and public accessibility address longstanding challenges in commercial building monitoring, including asset health, non-technical loss detection, and optimization for global sustainability. The demonstrated efficacy of adversarial temporal models on LEAD suggests promising avenues for leveraging complex sequential architectures and adversarial loss formulations in future multivariate, multi-modal, or geographically transferable anomaly detection systems (Gulati et al., 2022, Nia et al., 14 Jan 2026).
A plausible implication is that further architectural refinements (e.g., incorporating exogenous variables or auxiliary encoders for latent inversion) and extension to broader classes of building types and utilities will be necessary for fully generalizable energy anomaly detection platforms. Industry practitioners and academic researchers are therefore encouraged to exploit LEAD’s open-source tools and data, applying robust temporal modeling strategies and acknowledging current structural and contextual limitations.