Data-Driven Reachability Analysis

Updated 28 September 2025

Data-Driven Reachability Analysis Framework is a methodology that uses data instead of full analytic models to estimate all possible future states in uncertain dynamical systems.
It leverages simulations, measurements, and machine learning to identify sensitivity bounds and parameter uncertainties for reliable system monitoring.
The framework integrates offline and online data with formal guarantees, enabling real-time safety verification and robust control in cyber-physical applications.

A data-driven reachability analysis framework is a methodology for systematically over-approximating the set of all future states a (possibly uncertain or only partially known) dynamical system can attain, leveraging experimental, simulation, or online measurement data rather than requiring complete analytic models. This paradigm is vital for formal safety verification, control synthesis, and risk assessment in modern cyber-physical and hybrid systems in the face of modeling uncertainty, noise, and structural complexity. Data-driven reachability frameworks generalize classical model-based set propagation by substituting direct identification of reachable sets—using simulation, measurement traces, machine learning algorithms, and probabilistic tools—and provide a spectrum of guarantees ranging from formal “hard” safety to robust probabilistic bounds.

1. Key Principles and Scope

The central principle of a data-driven reachability analysis framework is replacing explicit knowledge of system dynamics with a structured use of data to (a) estimate dynamical sensitivity, (b) parameterize model uncertainty, or (c) directly compute the forward image of initial sets under admissible inputs and disturbances. This involves:

Learning sensitivity/discrepancy bounds from simulation or experimental traces to “bloat” simulated trajectories and form an outer approximation of the state evolution (Fan et al., 2017).
Estimating parameter sets (e.g., possible state-space model matrices) consistent with measurement data, often represented with zonotopes, matrix zonotopes, or other convex set classes (Alanwar et al., 2020, Alanwar et al., 2021, Li et al., 6 Feb 2024, Akhormeh et al., 21 Sep 2025).
Leveraging operator-theoretic lifting (Koopman or Perron–Frobenius) and data-driven finite-dimensional approximations to propagate distributions or moments for nonlinear systems (Matavalam et al., 2020, Li et al., 23 Feb 2025).
Scenario-based and support-set estimation approaches, including conformal inference, Christoffel function sublevel sets, and holdout-based sharp probabilistic error bounds, for systems lacking parametric structure or with high complexity (Devonport et al., 2021, Devonport et al., 2021, Dietrich et al., 9 Apr 2025, Hashemi et al., 20 May 2025).
Modular components that combine offline data (historical, simulated) with online adaptation (recursive estimation) and formal reasoning about safety properties without requiring model closure or completeness.

Frameworks are tailored for uncertain LTI or LTV systems, Lipschitz nonlinear systems, hybrid and piecewise affine systems, software/data-flow graphs, and stochastic dynamics. Applications span automotive safety-relevant control (e.g., powertrains, AEB, CAV platoons), pedestrian prediction, stochastic process control, mixed traffic environments, and safety filtering in LLM-controlled robots.

2. Methodological Approaches

Data-driven reachability frameworks synthesize a constellation of mathematical and computational methods:

Model Set Construction and Learning

Parameter-Set Identification: Constructs the set of model parameters consistent with measurement, input, and output data while accounting for noise. Common representations include zonotopes and matrix zonotopes:

$\mathcal{M}_{AB} = (X_+ - \mathcal{M}_w)[X_- \; U_-]^+,$

where $\mathcal{M}_w$ is a noise zonotope and $^+$ indicates the right-inverse (Alanwar et al., 2020).

Recursive Estimation and Adaptation: Real-time model set estimation uses recursive least squares with zonotopic parameter sets and exponential forgetting. Recursive update formulas are:

$C_{k+1} = (I - K_k \Phi_k) C_k + K_k y_k,\ G_{k+1}^{(i)} = \lambda^{-1/2}(I - K_k \Phi_k) G_k^{(i)} \;\; \forall i,$

where $\lambda$ is the forgetting factor (Akhormeh et al., 21 Sep 2025).

Discrepancy Learning: Sensitivity functions of the form $\beta(x_1, x_2, t) = |x_1 - x_2| \cdot K e^{\gamma t}$ are estimated by drawing pairs of simulation traces and recasting the problem as learning a linear separator in the $(\ln(\text{error ratio}), t)$ space, with PAC-learning guarantees (Fan et al., 2017).

Set Propagation and Reach Tube Computation

Zonotope Arithmetic: Propagation is achieved by set-valued recursion:

$\hat{\mathcal{R}}_{k+1} = \mathcal{M}_\Sigma(\hat{\mathcal{R}}_k \times \mathcal{U}_k) + \mathcal{Z}_w,$

where model, input, and process noise zonotopes are used (Alanwar et al., 2020, Alanwar et al., 2021).

Piecewise Linearization and Taylor Models: For nonlinear systems, local linearization plus overapproximation of the remainder (using bounds based on data density and Lipschitz constants) yields reachability updates involving zonotopes for the model, nonlinearity error, and noise (Alanwar et al., 2020, Farjadnia et al., 2022).
Support Set Estimation (Christoffel/Conformal Functions): For general systems, the reachable set is given by a sublevel set of the empirical inverse Christoffel function:

$C(x) = z_k(x)^\top \hat{M}^{-1}z_k(x), \quad \hat{\mathcal{R}} = \{x : C(x) \leq \alpha \},$

where $z_k$ is a polynomial feature vector, $\hat{M}$ the empirical moment matrix, and $\alpha$ the maximum value on data (Devonport et al., 2021, Devonport et al., 2021).

Scenario Optimization and Holdout Methods: Reach tubes and sets are constructed by solving volume-minimizing optimization subject to scenario constraints, evaluated a posteriori on holdout samples to get a binomial tail-based error bound (Dietrich et al., 9 Apr 2025).

Extensions for Complex Systems

Hybrid Zonotopes: For systems with both discrete (binary) and continuous modes, hybrid zonotopes encode both uncertainties, and set operations account for mode-specific model changes (Xie et al., 6 Apr 2025).
Operator-Theoretic Lifting: Koopman operator-based predictors learned from time series allow moment-propagation or lifted-space linear prediction under uncertainty, with subsequent reachability via set-based operations in the lifted space (Matavalam et al., 2020, Li et al., 23 Feb 2025).
Kernel Embeddings: In stochastic reachability, empirical estimation of transition kernels via RKHS embeddings reduces computation of safety probabilities to sequence of inner products, admitting finite-sample bounds and scalability via random Fourier features (Thorpe et al., 2020).

3. Formal Guarantees and Theoretical Underpinnings

Data-driven reachability frameworks are undergirded by

PAC (Probably Approximately Correct) Guarantees: Discrepancy learning, Christoffel function-based set estimation, scenario optimization, and conformal inference all provide high-confidence, sample-efficient probabilistic guarantees of the form “with probability $1-\delta$ , at least $1-\epsilon$ of future states are contained in the computed reachable set” (Fan et al., 2017, Devonport et al., 2021, Mejia et al., 2021, Dietrich et al., 9 Apr 2025, Hashemi et al., 20 May 2025).
Deterministic Overapproximations: When model sets are constructed as zonotopes or ellipsoids encompassing all consistent parameters under bounded noise, the propagated reachable set is always guaranteed to contain all possible states produced by any admissible sequence of models, inputs, and process noises in the uncertainty sets (Alanwar et al., 2020).
Compositional and Simulation-Based Reasoning: DRYVR utilizes sequential composition and forward simulation on transition graphs to extend finite-horizon or simplified system verification to arbitrarily long runs or more detailed systems, enabling verification of complex hybrid automata from building blocks (Fan et al., 2017).
Handling of Distribution Shift: Incorporation of robust conformal inference quantifies the impact of non-identically distributed deployment data, controlling the degradation of probabilistic guarantees under bounded total variation distances (Hashemi et al., 20 May 2025).

4. Algorithmic and Implementation Considerations

Sample Complexity: Achieving tight probabilistic bounds typically requires a number of simulation or experimental samples scaling inversely with accuracy $\epsilon$ and logarithmically with confidence $1/\delta$ and the complexity of the function class (e.g., polynomial degree or number of modes) (Devonport et al., 2021, Devonport et al., 2021).
Recursive and Online Algorithms: Zonotopic recursive least squares approaches update parameter zonotopes in real time, offering robustness to time-varying system matrices and providing less conservative reach sets compared to traditional batch identification (Akhormeh et al., 21 Sep 2025).
Computational Scalability: Matrix- and hybrid-zonotope propagations, as well as scenario optimization-based approaches, are compatible with high-dimensional systems, although computational effort grows with set complexity and time horizon (Alanwar et al., 2020, Alanwar et al., 2021, Xie et al., 6 Apr 2025). Hypercube inflation in the PCA-DDReach method is mitigated by error-space rotation via PCA, striking a balance between scalability and conservatism (Hashemi et al., 20 May 2025).
Integration with Decision-Making Systems: Data-driven reachable sets comprise constraints in data-enabled predictive control (including tube-based and robust MPC), formal safety layers for neural or language-model control policies (Hafez et al., 5 Mar 2025), and are modularly extensible to incorporate temporal logic side information (Alanwar et al., 2021).

5. Benchmarks and Applications

Numerous frameworks have demonstrated effectiveness across diverse application domains:

Framework / Class	Application Domains	Notable Features/Benchmarks
DRYVR (Fan et al., 2017)	Automotive (powertrain, AEB, lane merge, ADAS)	Discrepancy PAC-learning; compositional/simulation reasoning
Zonotopic/Matrix Zonotope (Alanwar et al., 2020, Alanwar et al., 2021, Akhormeh et al., 21 Sep 2025)	Data-driven robust MPC, JetRacer, CSTR processes	Online recursive estimation; robust reach sets
Christoffel/Support Set (Devonport et al., 2021, Devonport et al., 2021)	Duffing oscillator, Quadrotor, Traffic networks	PAC and PAC-Bayes guarantees; complex geometry set encapsulation
Koopman+Reachability (Matavalam et al., 2020, Li et al., 23 Feb 2025)	Mixed vehicle platoons, nonlinear moment propagation	Operator-theoretic lifting, secondary zonotopic modeling
Scenario/Conformal (Dietrich et al., 9 Apr 2025, Hashemi et al., 20 May 2025)	Quadrotor, Powertrain, Nonlinear tubes	Binomial tail holdout bounds; PCA-based error space rotation
Hybrid Zonotopes (Xie et al., 6 Apr 2025)	Switching/hybrid systems (e.g., vehicle regions)	Piecewise affine overapproximation at region boundaries
Pedestrian Prediction (Söderlund et al., 2023, Fragkedaki et al., 9 Aug 2024)	Urban/crowd robot navigation	Behavior mode clustering; transformer-based embedding; HDBSCAN
SReachTools (Thorpe et al., 2020)	Stochastic, high-dimensional systems	RKHS kernel embedding, RFF scalability, neural net black-box control
LLM Safety (Hafez et al., 5 Mar 2025)	LLM-controlled robots (navigation, JetRacer)	Data-driven safety filter, zonotope model, formal safety layer

Experiments typically report improved tightness (less conservatism) and computational efficiency relative to naive all-sample or purely model-based approaches, with probabilistic guarantees validated against analytically known reachable sets or by extensive Monte Carlo validation.

6. Practical Implications and Limitations

Data-driven reachability frameworks broaden the scope of formal verification and constraint satisfaction to settings where first-principles models are poorly known, nonstationary, or intractable. This is critical in safety-critical systems (autonomous vehicles, robotic platforms, and complex industrial and power networks) as well as in software and program analysis (e.g., FlowCFL for mutable heap data (Milanova, 2020)). Notable practical implications include:

Agility in deployment: Real-time, recursive estimation adapts rapidly to changing environments or system drift (Akhormeh et al., 21 Sep 2025).
Scalability: Frameworks leveraging operator-theoretic lifting, support set estimation, or PCA-based error inflation are tractable in high-dimensional or long-horizon scenarios (Matavalam et al., 2020, Hashemi et al., 20 May 2025).
Safety under uncertainty: The interplay of scenario-based, PAC, and robust conformal inference delivers explicit, quantifiable safety margins—even under distribution shift (Dietrich et al., 9 Apr 2025, Hashemi et al., 20 May 2025).
Modularity: Integration with high-level reasoning (temporal logic constraints (Alanwar et al., 2021)), behavior clustering (Söderlund et al., 2023, Fragkedaki et al., 9 Aug 2024), and formal control policies is made possible by the modular decomposition of the frameworks.

Limitations include an inherent trade-off between conservatism and sample complexity for deterministic “hard” bounds—removal of all statistical uncertainty in guarantees entails an exponential increase in required samples relative to dimensionality (Dietrich et al., 9 Apr 2025). Computational costs can also grow with set complexity (particularly for hybrid or high-dimensional systems), and the efficacy of error inflation reduction is dependent on the error geometry (PCA efficacy), the richness of basis/dictionary functions, and the adequacy of the sample coverage.

7. Future Directions

Research in data-driven reachability analysis is rapidly advancing, with active themes including:

Enhanced set representations for hybrid, stochastic, and nonlinear dynamics, with emphasis on reducing conservatism without loss of guarantee.
Robust methods to handle larger and more abrupt distribution shifts, possibly combining data-driven and physics-based knowledge (hybrid approaches).
Improved algorithms for recursive, real-time update and validation of reachable sets in nonstationary and learning-enabled systems.
Integration of data-driven reachability with reinforcement learning, neural network-controlled systems, LLM planning (e.g., LLM-driven robot safety), and compositional formal verification across system scales.
Adaptation of the framework to distributed/multimodal systems (e.g., networked CAVs or multi-robot architectures), and to complex cyber-physical networks with switching dynamics or adversarial agents.

These directions are driven by both technological imperatives in automation and autonomy, and the mathematical challenge of formalizing safety and control under structural, stochastic, and epistemic uncertainty—realms where data-driven reachability frameworks are now foundational.