gridfm-datakit-v1: Power Flow Dataset Generator

Updated 23 December 2025

gridfm-datakit-v1 is an open-source Python library that synthesizes realistic power flow (PF) and optimal power flow (OPF) datasets using diverse perturbation techniques.
The tool supports scalable generation for transmission grids up to 30,000 buses, integrating real load profiles, stochastic noise, and arbitrary N-k topology changes.
It offers both command-line and Jupyter interfaces with built-in validation, DC baselines, and parallel processing for efficient and reproducible dataset creation.

gridfm-datakit-v1 is an open-source Python library for generating large-scale, realistic, and diverse Power Flow (PF) and Optimal Power Flow (OPF) datasets, targeting machine learning training and evaluation for transmission power grids. It addresses critical limitations of existing PF/OPF dataset generators, including the lack of realistic stochastic load and topology perturbations, the restriction of PF datasets to OPF-feasible points, and the use of fixed generator cost functions in OPF datasets. By integrating global load scaling from real-world profiles, localized noise, arbitrary N-k topology perturbations, and cost diversities, gridfm-datakit-v1 enables the efficient and scalable synthesis of both realistic and challenging PF/OPF scenarios—supporting up to 30,000-bus PF and 10,000-bus OPF instances. The tool incorporates parallelism, data validation, built-in DC baselines, and supports both command-line and interactive usage. All algorithms and code are available under Apache 2.0 on GitHub and PyPI (Puech et al., 16 Dec 2025).

1. Functional Overview and Key Capabilities

gridfm-datakit-v1 is designed for realistic and large-scale PF/OPF dataset generation in transmission grids. Its core features include:

Scalability: Capable of PF generation for grids up to 30,000 buses and OPF generation for up to 10,000 buses.
Hybrid load perturbation: Integrates real aggregated time series (e.g., ERCOT hourly data) with independent, per-bus multiplicative noise, maintaining both temporal and spatial load correlations.
Topology variation: Supports arbitrary N-k (with k specified by the user) component outages, including lines, transformers, and generators, as well as random admittance scaling.
Data generation modes:
- OPF mode: Generates strictly feasible OPF solutions, randomly permuting or rescaling generator cost functions across topologies.
- PF mode: Produces PF samples that can contain natural operating constraint violations (including voltage, angle, and branch limits).
Interfaces and outputs: Provides command-line and Jupyter-based interactive interfaces, multi-format outputs (bus, branch, and generator matrices), integrated DC solutions, and thorough statistics and validation utilities.

These capabilities enable the production of datasets that exhibit broad operational diversity, constrained and unconstrained behaviors, and a range of cost landscapes, addressing the generalization and robustness testing needs of data-driven solver research (Puech et al., 16 Dec 2025).

2. System Architecture and Workflow

The architecture is modular and scriptable:

Core modules:
- gridfm_datakit.generator: Implements scenario creation, load and topology perturbation, and PF/OPF solvers.
- gridfm_datakit.interactive: Jupyter notebook GUI for configuration.
- gridfm_datakit.scripts: Command-line utilities for generation, validation, and statistics.
Principal classes:
- ScenarioBuilder: Orchestrates all perturbations and scenario generation tasks.
- DataWriter: Exports dataset matrices and associated DC solutions.
- Validator: Verifies AC power balance and constraint compliance post-generation.
- StatsReporter: Compiles residual, loading, and runtime distributions.

Installation and Usage Examples:

Action	Command/Snippet	Location
Install (PyPI)	`pip install gridfm-datakit`	CLI
Install (source)	`git clone https://github.com/gridfm/gridfm-datakit`<br>`cd gridfm-datakit`<br>`pip install .`	CLI
Jupyter interface	`from gridfm_datakit.interactive import interactive_interface`<br>`interactive_interface()`	Python/Jupyter
Generate (CLI/YAML)	`gridfm-datakit generate path/to/config.yaml`	Command line
Python API usage	<pre>from gridfm_datakit.generator import ScenarioBuilder<br>builder = ScenarioBuilder(<br>grid_file="case118.m",<br>load_profile="ercot_load.csv",<br>n_k=2,<br>local_noise=0.2,<br>admittance_noise=0.2)<br>samples = builder.generate_pf_samples(n_loads=100, topologies_per_load=10)<br>builder.writer.write_csv("pf_data/")</pre>	Python script

Parallel execution across CPU cores is standard, with scenario independence leveraged for efficient sampling (Puech et al., 16 Dec 2025).

3. Mathematical Models

Power Flow Formulation

For bus $i$ (complex voltage $V_i = V_{m,i}e^{jV_{a,i}}$ , net injection $S_i = P_i + j Q_i$ ):

Active power equation:

$P_i = \sum_{j \in \mathcal{N}(i)} V_{m,i} V_{m,j} \bigl(G_{ij} \cos(V_{a,i} - V_{a,j}) + B_{ij} \sin(V_{a,i} - V_{a,j})\bigr)$

Reactive power equation:

$Q_i = \sum_{j \in \mathcal{N}(i)} V_{m,i} V_{m,j} \bigl(G_{ij} \sin(V_{a,i} - V_{a,j}) - B_{ij} \cos(V_{a,i} - V_{a,j})\bigr)$

Branch flow limits:

$|S_{ij}| = \sqrt{P_{ij}^2 + Q_{ij}^2} \leq S_{ij}^{\max}.$

AC OPF Formulation

For generator set $\mathcal{G}$ , generator cost $c_i(P_{g,i})$ : $\begin{align*} \text{minimize}_{V,P_g,Q_g} \quad & \sum_{i \in \mathcal{G}} c_i(P_{g,i}) \ \text{subject to} \;\; & \sum_{i \in \mathcal{G}_i} P_{g,i} - P_{d,i} = P_i(V), \quad \forall i \ & \sum_{i \in \mathcal{G}_i} Q_{g,i} - Q_{d,i} = Q_i(V), \quad \forall i \ & P_{g,i}^{\min} \leq P_{g,i} \leq P_{g,i}^{\max} \ & Q_{g,i}^{\min} \leq Q_{g,i} \leq Q_{g,i}^{\max} \ & V_{m,i}^{\min} \leq V_{m,i} \leq V_{m,i}^{\max} \ & |S_{ij}| \leq S_{ij}^{\max} \end{align*}$

In PF mode, gridfm-datakit first solves an ACOPF for unperturbed topology, applies further perturbations, and then solves a PF with the fixed dispatch, potentially producing states that violate operating constraints. In OPF mode, it permutes cost functions before resolving the ACOPF, ensuring feasibility (Puech et al., 16 Dec 2025).

4. Data Generation and Perturbation Methodology

Global load scaling: Time series $\{ref_t\}$ is used to scale nominal bus loads, and per-bus multiplicative noise $\epsilon^{p,q}_{i,t} \sim \mathcal{U}(1-\sigma,1+\sigma)$ is applied, with $\sigma$ defaulting to 0.2. This injects both correlated (temporal and spatial) and independent fluctuations.
Topology perturbation (N-k outages): Configurable outage cardinality $k$ , supporting exhaustive or sampled combinations; eliminates islanding cases.
Admittance perturbations: Resistance/reactance per branch scaled randomly within $[\max(0, 1-\sigma), 1+\sigma]$ .
PF mode: Constraints are solved in the base OPF, then perturbed (topology/admittance) with fixed dispatch. Violations of operational constraints (voltage, branch, angle, reactive/slack limits) are not artificially suppressed.
OPF mode: Generator cost coefficients are randomly permuted or rescaled prior to each ACOPF, yielding constraint-satisfying but cost-diverse optimal points.

This approach generates datasets reflecting the operational variety and constraint boundary-crossing behaviors observed in real large-scale grids (Puech et al., 16 Dec 2025).

5. Scalability, Validation, and Performance

gridfm-datakit-v1 demonstrates high-throughput and cross-architecture scalability:

Benchmark runtimes:
- PF: 200,000 samples—IEEE-24 in 2.7 CPU h, IEEE-118 in 6.4 CPU h, GOC-2,000 in 248 CPU h, GOC-10,000 in 1384 CPU h (99% convergence).
- OPF: 200,000 samples—IEEE-24 in 21 CPU h, IEEE-118 in 46 CPU h, GOC-2,000 in 1104 CPU h, GOC-10,000 in 3628 CPU h (98% convergence).
Execution environment: 20–100 CPU cores, 32–256 GB RAM. Core solvers are PowerModels.jl/Ipopt, with further gains possible from fast linear solvers (2–6× speedup).
Parallelization: All scenarios are processed concurrently, with a single OPF per PF load scenario and rapid evaluation of multiple post-OPF topology cases.
Validation/statistics: Data integrity (AC balance, constraint satisfaction) is verified post-generation. Dedicated tooling computes residuals, loadings, and runtime histograms.

6. Comparison with Established PF/OPF Dataset Tools

gridfm-datakit-v1 exhibits several technical distinctions compared to OPFData, OPF-Learn, PGLearn, and PFΔ:

Feature	gridfm-datakit-v1	Other Libraries
Load modeling	Real profile + noise (hybrid)	Uniform/convex sampling
Topology perturbations	Arbitrary N-k	N-1 only or none
Generator cost functions	Permuted / rescaled	Fixed
PF constraint violations	Natural from physics	Suppressed or sampled within limits
Scalability (bus count)	30,000 PF / 10,000 OPF	2,000–6,000
Realism	Temporal coherence, constraint violations	Often lacks real profiles/boundary cases
Licensing and openness	Full Apache 2.0, DC baseline, validation	Varied

gridfm-datakit-v1’s emphasis on realistic load traces, diverse topology/cost perturbations, scalability, and both realistic and infeasible operational states allows it to support advanced ML and optimization methods—including GNNs and foundation models—for power flow, contingency analysis, and market forecasting (Puech et al., 16 Dec 2025).

7. Extensibility and Customization for Research Workflows

Load profiles: Users can inject custom CSV-based time series at the system or bus level.
Cost functions: Generator coefficients can be user-defined, non-convex, or sampled from custom distributions.
Topology procedures: Alternative or user-defined outage samplers/subgraph selectors can be incorporated.
Physics modeling: Bus-level renewable injections, dynamic tap changers, or FACTS device behaviors may be added.
Downstream utility: Seamless integration with graph-based ML pipelines (e.g., via gridfm-graphkit) is supported.

A plausible implication is that gridfm-datakit-v1 forms a foundation both for dataset creation and for benchmarking emergent ML-based OPF solvers and grid modeling tools at previously unattainable scale, diversity, and physical realism (Puech et al., 16 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

gridfm-datakit-v1: A Python Library for Scalable and Realistic Power Flow and Optimal Power Flow Data Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to gridfm-datakit-v1.