gridfm-datakit-v1: Power Flow Dataset Generator
- gridfm-datakit-v1 is an open-source Python library that synthesizes realistic power flow (PF) and optimal power flow (OPF) datasets using diverse perturbation techniques.
- The tool supports scalable generation for transmission grids up to 30,000 buses, integrating real load profiles, stochastic noise, and arbitrary N-k topology changes.
- It offers both command-line and Jupyter interfaces with built-in validation, DC baselines, and parallel processing for efficient and reproducible dataset creation.
gridfm-datakit-v1 is an open-source Python library for generating large-scale, realistic, and diverse Power Flow (PF) and Optimal Power Flow (OPF) datasets, targeting machine learning training and evaluation for transmission power grids. It addresses critical limitations of existing PF/OPF dataset generators, including the lack of realistic stochastic load and topology perturbations, the restriction of PF datasets to OPF-feasible points, and the use of fixed generator cost functions in OPF datasets. By integrating global load scaling from real-world profiles, localized noise, arbitrary N-k topology perturbations, and cost diversities, gridfm-datakit-v1 enables the efficient and scalable synthesis of both realistic and challenging PF/OPF scenarios—supporting up to 30,000-bus PF and 10,000-bus OPF instances. The tool incorporates parallelism, data validation, built-in DC baselines, and supports both command-line and interactive usage. All algorithms and code are available under Apache 2.0 on GitHub and PyPI (Puech et al., 16 Dec 2025).
1. Functional Overview and Key Capabilities
gridfm-datakit-v1 is designed for realistic and large-scale PF/OPF dataset generation in transmission grids. Its core features include:
- Scalability: Capable of PF generation for grids up to 30,000 buses and OPF generation for up to 10,000 buses.
- Hybrid load perturbation: Integrates real aggregated time series (e.g., ERCOT hourly data) with independent, per-bus multiplicative noise, maintaining both temporal and spatial load correlations.
- Topology variation: Supports arbitrary N-k (with k specified by the user) component outages, including lines, transformers, and generators, as well as random admittance scaling.
- Data generation modes:
- OPF mode: Generates strictly feasible OPF solutions, randomly permuting or rescaling generator cost functions across topologies.
- PF mode: Produces PF samples that can contain natural operating constraint violations (including voltage, angle, and branch limits).
- Interfaces and outputs: Provides command-line and Jupyter-based interactive interfaces, multi-format outputs (bus, branch, and generator matrices), integrated DC solutions, and thorough statistics and validation utilities.
These capabilities enable the production of datasets that exhibit broad operational diversity, constrained and unconstrained behaviors, and a range of cost landscapes, addressing the generalization and robustness testing needs of data-driven solver research (Puech et al., 16 Dec 2025).
2. System Architecture and Workflow
The architecture is modular and scriptable:
- Core modules:
gridfm_datakit.generator: Implements scenario creation, load and topology perturbation, and PF/OPF solvers.gridfm_datakit.interactive: Jupyter notebook GUI for configuration.gridfm_datakit.scripts: Command-line utilities for generation, validation, and statistics.
- Principal classes:
ScenarioBuilder: Orchestrates all perturbations and scenario generation tasks.DataWriter: Exports dataset matrices and associated DC solutions.Validator: Verifies AC power balance and constraint compliance post-generation.StatsReporter: Compiles residual, loading, and runtime distributions.
Installation and Usage Examples:
| Action | Command/Snippet | Location |
|---|---|---|
| Install (PyPI) | pip install gridfm-datakit |
CLI |
| Install (source) | git clone https://github.com/gridfm/gridfm-datakit<br>cd gridfm-datakit<br>pip install . |
CLI |
| Jupyter interface | from gridfm_datakit.interactive import interactive_interface<br>interactive_interface() |
Python/Jupyter |
| Generate (CLI/YAML) | gridfm-datakit generate path/to/config.yaml |
Command line |
| Python API usage | <pre>from gridfm_datakit.generator import ScenarioBuilder<br>builder = ScenarioBuilder(<br>grid_file="case118.m",<br>load_profile="ercot_load.csv",<br>n_k=2,<br>local_noise=0.2,<br>admittance_noise=0.2)<br>samples = builder.generate_pf_samples(n_loads=100, topologies_per_load=10)<br>builder.writer.write_csv("pf_data/")</pre> | Python script |
Parallel execution across CPU cores is standard, with scenario independence leveraged for efficient sampling (Puech et al., 16 Dec 2025).
3. Mathematical Models
Power Flow Formulation
For bus (complex voltage , net injection ):
- Active power equation:
- Reactive power equation:
- Branch flow limits:
AC OPF Formulation
For generator set , generator cost :
In PF mode, gridfm-datakit first solves an ACOPF for unperturbed topology, applies further perturbations, and then solves a PF with the fixed dispatch, potentially producing states that violate operating constraints. In OPF mode, it permutes cost functions before resolving the ACOPF, ensuring feasibility (Puech et al., 16 Dec 2025).
4. Data Generation and Perturbation Methodology
- Global load scaling: Time series is used to scale nominal bus loads, and per-bus multiplicative noise is applied, with defaulting to 0.2. This injects both correlated (temporal and spatial) and independent fluctuations.
- Topology perturbation (N-k outages): Configurable outage cardinality , supporting exhaustive or sampled combinations; eliminates islanding cases.
- Admittance perturbations: Resistance/reactance per branch scaled randomly within .
- PF mode: Constraints are solved in the base OPF, then perturbed (topology/admittance) with fixed dispatch. Violations of operational constraints (voltage, branch, angle, reactive/slack limits) are not artificially suppressed.
- OPF mode: Generator cost coefficients are randomly permuted or rescaled prior to each ACOPF, yielding constraint-satisfying but cost-diverse optimal points.
This approach generates datasets reflecting the operational variety and constraint boundary-crossing behaviors observed in real large-scale grids (Puech et al., 16 Dec 2025).
5. Scalability, Validation, and Performance
gridfm-datakit-v1 demonstrates high-throughput and cross-architecture scalability:
- Benchmark runtimes:
- PF: 200,000 samples—IEEE-24 in 2.7 CPU h, IEEE-118 in 6.4 CPU h, GOC-2,000 in 248 CPU h, GOC-10,000 in 1384 CPU h (99% convergence).
- OPF: 200,000 samples—IEEE-24 in 21 CPU h, IEEE-118 in 46 CPU h, GOC-2,000 in 1104 CPU h, GOC-10,000 in 3628 CPU h (98% convergence).
- Execution environment: 20–100 CPU cores, 32–256 GB RAM. Core solvers are PowerModels.jl/Ipopt, with further gains possible from fast linear solvers (2–6× speedup).
- Parallelization: All scenarios are processed concurrently, with a single OPF per PF load scenario and rapid evaluation of multiple post-OPF topology cases.
- Validation/statistics: Data integrity (AC balance, constraint satisfaction) is verified post-generation. Dedicated tooling computes residuals, loadings, and runtime histograms.
6. Comparison with Established PF/OPF Dataset Tools
gridfm-datakit-v1 exhibits several technical distinctions compared to OPFData, OPF-Learn, PGLearn, and PFΔ:
| Feature | gridfm-datakit-v1 | Other Libraries |
|---|---|---|
| Load modeling | Real profile + noise (hybrid) | Uniform/convex sampling |
| Topology perturbations | Arbitrary N-k | N-1 only or none |
| Generator cost functions | Permuted / rescaled | Fixed |
| PF constraint violations | Natural from physics | Suppressed or sampled within limits |
| Scalability (bus count) | 30,000 PF / 10,000 OPF | 2,000–6,000 |
| Realism | Temporal coherence, constraint violations | Often lacks real profiles/boundary cases |
| Licensing and openness | Full Apache 2.0, DC baseline, validation | Varied |
gridfm-datakit-v1’s emphasis on realistic load traces, diverse topology/cost perturbations, scalability, and both realistic and infeasible operational states allows it to support advanced ML and optimization methods—including GNNs and foundation models—for power flow, contingency analysis, and market forecasting (Puech et al., 16 Dec 2025).
7. Extensibility and Customization for Research Workflows
- Load profiles: Users can inject custom CSV-based time series at the system or bus level.
- Cost functions: Generator coefficients can be user-defined, non-convex, or sampled from custom distributions.
- Topology procedures: Alternative or user-defined outage samplers/subgraph selectors can be incorporated.
- Physics modeling: Bus-level renewable injections, dynamic tap changers, or FACTS device behaviors may be added.
- Downstream utility: Seamless integration with graph-based ML pipelines (e.g., via gridfm-graphkit) is supported.
A plausible implication is that gridfm-datakit-v1 forms a foundation both for dataset creation and for benchmarking emergent ML-based OPF solvers and grid modeling tools at previously unattainable scale, diversity, and physical realism (Puech et al., 16 Dec 2025).