SIDED: Industrial Energy Disaggregation Dataset
- SIDED is a synthetic dataset generated via high-fidelity digital twin simulations to benchmark NILM algorithms in industrial environments.
- It offers detailed aggregate and appliance-level power traces from diverse facility types, geographies, and operational conditions.
- The AMDA method scales appliance signals to reduce disaggregation error, significantly enhancing model generalization across configurations.
A Synthetic Industrial Dataset for Energy Disaggregation (SIDED) is an open-source, physically realistic dataset designed for benchmarking and developing Non-Intrusive Load Monitoring (NILM) algorithms in industrial environments. SIDED is generated using calibrated digital twin simulations to capture the diverse, complex, and often overlapping load profiles of industrial facilities and their constituent appliances. The dataset and its associated augmentation methods address data scarcity, privacy constraints, and generalization challenges inherent in industrial NILM research, providing both aggregate and appliance-level power traces across a range of facility types, geographies, and operational regimes.
1. Digital Twin-Based Dataset Generation
SIDED is produced by high-fidelity digital twin simulations of industrial energy systems. These models:
- Represent the physical behaviors of industrial appliances (e.g., combined heat and power (CHP), cooling systems, electric vehicle supply equipment (EVSE), photovoltaics (PV), background appliances) and their controllers, as well as environmental processes such as irradiation heating and energy dissipation.
- Are calibrated to real operational and meteorological data (e.g., from an office building in Offenbach, Germany) to achieve physical realism. The calibration achieves ≤3% annual discrepancy relative to historical measurements.
- Support data generalization by adapting simulation parameters (facility size, configuration, control setpoints, weather) according to real-world building data and location-specific profiles.
SIDED contains data for three canonical industrial facility types—dealer (retail), office (administrative), and logistics (warehouse/distribution)—each simulated at three geographically and climatically distinct locations: Offenbach (Germany), Los Angeles (USA), and Tokyo (Japan). Each facility-location pairing features one year of simulated data at 1-minute granularity (525,600 samples/configuration), accompanied by 5-minute downsampled versions for computational expedience in experiments.
For each configuration, the dataset includes:
- Aggregate active power load
- Appliance-level real power for EVSE, CS, PV (can be negative, denoting generation), CHP (generation), and background appliances (BA)
- Ambient temperature and solar radiation measurements.
This synthetic design enables ground-truth access for both appliance-level signals and aggregate measurements under a wide array of realistic industrial operating envelopes.
2. Dataset Characteristics and Diversity
The SIDED dataset is architected to reflect the operational heterogeneity typical of industrial energy systems:
- Appliance Diversity: Signals for both consumer (EVSE, CS, BA) and producer (PV, CHP) types, with operational characteristics spanning constantly-on variables (CHP), periodic operation (EVSE, PV), seasonal variation (PV, CS, CHP), and complex multi-pattern appliances (BA).
- Temporal and Climatic Variation: Each facility-location simulates distinct working schedules, holiday observance, and weather-influenced loads, producing broad coverage of scenarios relevant to global industrial contexts.
- Physical Consistency: Simulation outputs maintain consistent physics for both measured and hypothetical configurations, leveraging model transferability across unexplored geographic contexts by changing weather files and demand patterns according to the target site.
- Open-Source Availability: All raw signals and metadata are accessible at https://github.com/ChristianInterno/SIDED.
The resulting dataset enables systematic paper of appliance-level disaggregation across a range of seasonal, diurnal, and anomalous events.
3. Appliance-Modulated Data Augmentation (AMDA)
SIDED introduces Appliance-Modulated Data Augmentation (AMDA) as a principled, computationally efficient method for boosting the diversity and generalization of NILM model training data. The central premise is that standard augmentation techniques—such as random warping or naive scaling—are suboptimal for industrial contexts, where appliance powers differ by orders of magnitude and signals are often continuously overlapping.
AMDA addresses this by scaling each appliance's time series using its relative contribution to the total aggregate power, such that more prominent appliances are downscaled and underrepresented ones are upscaled. The method is formally defined as follows:
- Let denote the power signal of appliance at time .
- Compute total appliance consumption:
- The scaling factor for appliance is
where is a tunable augmentation intensity (typically ).
- The augmented appliance signal:
- The augmented aggregate:
This formulation ensures that the training data better reflects a balanced variety of appliance combinations and power levels, closing the empirical distributional gap between training and test distributions—an especially acute issue for industrial applications.
4. Experimental Evaluation and Model Generalization
Experimental assessment of SIDED and AMDA centers on rigorous generalization across out-of-distribution appliance, facility, and location settings, using normalized disaggregation error (NDE) as the primary evaluation metric:
Key experimental scenarios include:
- Appliance Variation: Models are trained on one facility-configuration and tested on a variant with a scaled (up or down) background appliance (BA), simulating real-world retrofits or energy-saving interventions. AMDA-trained models achieve a reduction in NDE from 0.223 (no augmentation) to 0.040 (AMDA), an 81% improvement for the Attention-TCN disaggregation architecture.
- Facility Variation: When training on one facility-location configuration (e.g., office in Offenbach) and testing on a different facility or city (e.g., logistics in Los Angeles), models trained with AMDA reach NDE = 0.093, compared to 0.451 (no augmentation) and 0.290 (random data augmentation). This approaches the theoretical maximum performance (oracle), which yields NDE = 0.009.
- Data Distribution Analysis: Statistical measures (Jensen–Shannon divergence, UMAP projection) confirm that AMDA brings the training and test data distributions into close alignment, explaining its superior generalization.
A plausible implication is that AMDA facilitates robust NILM model learning under data imbalances, and enables practical cross-facility transfer of disaggregation models without requiring site-specific retraining or prohibitively large datasets.
5. Implications for Industrial Energy Disaggregation Research
SIDED constitutes, as of the referenced publication, the first physically realistic, granular, and reproducible synthetic dataset for industrial NILM. By leveraging parameterized digital twin simulations and AMDA, it addresses multiple central challenges:
- Data Scarcity and Diversity: Users can generate arbitrary durations and configurations of aggregate and appliance-level signals, including rare or hypothetical appliance mixes.
- Privacy Preservation: Synthetic data with physical realism eliminates the need to share sensitive operational records.
- Benchmarking and Method Development: SIDED supports systematic benchmarking of NILM algorithms under controlled, reproducible scenarios, and allows direct evaluation of generalization gaps.
- Model Robustness: AMDA-augmented datasets yield NILM models that are robust to shifts in facility configuration, appliance replacement, or environmental conditions—crucial for real-world deployment.
6. Summary Table: SIDED and AMDA at a Glance
Aspect | Feature/Result |
---|---|
Dataset Type | Physically realistic, multi-site, multi-appliance, open-source synthetic dataset |
Generation Method | Digital twin simulation calibrated to real data, parameterizable by site/weather |
Appliances Modeled | EVSE, cooling, PV, CHP, background (consumers and generators; various patterns) |
Temporal Granularity | 1-minute (525,600 samples per configuration/year), 5-minute for experimental runs |
Data Augmentation | Appliance-Modulated Data Augmentation (AMDA), scaling by relative appliance impact |
Best Experimental Result | AMDA achieves NDE = 0.093 vs. 0.451 (no aug.), 0.290 (random aug.), 0.009 (oracle) |
Data Distribution Shift | AMDA directly aligns training and test set distributions (confirmed by UMAP/JS) |
Open Data/Code | https://github.com/ChristianInterno/SIDED |
7. Future Directions
As established in the paper, possible extensions include integrating additional appliance types, supporting higher-frequency monitoring, simulating more complex industrial process correlations, and extending AMDA to utilize context-aware or scenario-specific scaling strategies. Further development toward benchmarking transfer to real industrial datasets and incorporating domain-specific priors into the simulation/augmentation pipeline represents an open area for research collaboration.
SIDED, combined with the AMDA approach, provides a foundational resource for empirical NILM studies in industrial contexts, enabling robust evaluation, fair benchmarking, and practical model deployment in environments where real-world data collection remains difficult, costly, or privacy-sensitive.