Dynamical Systems Formulation of Deep Networks

Updated 6 May 2026

Dynamical systems formulation is a mathematical framework that views deep networks as discrete or continuous evolutions governed by difference or differential equations.
It unifies layerwise propagation, training dynamics, and control principles to provide rigorous insights into approximation, stability, and generalization.
This approach leads to practical design metrics, linking network depth, activation choices, and integration schemes to robust, interpretable, and efficient learning.

The dynamical systems formulation of deep networks formalizes deep neural architectures as either discrete- or continuous-time dynamical systems, providing a mathematical framework for analyzing properties such as approximation, generalization, stability, memory, and optimization from the perspective of dynamical systems and control theory. This approach captures both the layerwise propagation of activations and the training of parameters as evolutions in high-dimensional state spaces, governed by difference or differential equations. Such a viewpoint unites disparate strands of deep learning theory—approximation, complexity, optimization, and implicit regularization—under a set of common dynamical and geometric principles.

1. Dynamical Systems Representations of Deep Networks

Deep networks, both vanilla (feedforward) and residual architectures, can be cast as discrete-time non-autonomous dynamical systems. In a feedforward network of depth $L$ , the layerwise activations $x_k \in \mathbb{R}^d$ evolve according to

$x_{k+1} = f_k(x_k; \theta_k),$

where $f_k$ is the layer function (typically affine transformation plus nonlinearity), and $\theta_k$ are learnable parameters. For uniform architectures, this relation defines an autonomous discrete map, while depth-dependent parameters result in a non-autonomous system (Chemnitz et al., 7 Jul 2025, Duan et al., 2022).

Residual networks (ResNets) implement the Euler discretization of a continuous-time ODE:

$x_{k+1} = x_k + h f(x_k; \theta_k), \quad h = T/L.$

Passing to the continuous limit ( $h \to 0$ , $L \to \infty$ ) yields

$\frac{dx(t)}{dt} = f(x(t), \theta(t)), \quad x(0) = x_0,$

where $\theta(t)$ is a time-continuous control, and the resulting flow map $x_k \in \mathbb{R}^d$ 0 defines the architecture as a controlled dynamical system (Li et al., 2019, Chemnitz et al., 7 Jul 2025, Liu et al., 9 Oct 2025).

Even plain vanilla feedforward networks (with depth $x_k \in \mathbb{R}^d$ 1 and width equal to the data dimension) can in principle arise as discretizations of continuous flows, via careful application of splitting methods and composition of leaky-ReLU layers—restoring a homeomorphic correspondence between network function and the (possibly nonlinear) flow map of an ODE (Duan et al., 2022).

2. Approximation Theory of Deep Dynamical Networks

The dynamical systems paradigm yields rigorous universal approximation theorems for deep (continuous- or discrete-time) networks viewed as flow maps generated by parameterized vector fields. For continuous-time ResNets, sufficient conditions for $x_k \in \mathbb{R}^d$ 2-universal approximation include: a Lipschitz control family $x_k \in \mathbb{R}^d$ 3 with restricted affine invariance, and the closure of $x_k \in \mathbb{R}^d$ 4 containing a "well-function" supporting flexible rearrangement. If these hold, the set of flow maps (with a terminal affine layer $x_k \in \mathbb{R}^d$ 5) can approximate any continuous function on compact sets to arbitrary accuracy (Li et al., 2019):

$x_k \in \mathbb{R}^d$ 6

with $x_k \in \mathbb{R}^d$ 7 and suitable $x_k \in \mathbb{R}^d$ 8.

Practical architectures satisfying these criteria include:

Fully connected/conv ResNets with ReLU, sigmoid, or tanh activation (affine invariance), where well-functions can be constructed explicitly (e.g., with combinations of ReLU or sigmoid).
Residual blocks of arbitrary depth and standard nonlinearities (universality is robust under block design).
1D monotone increasing target functions can be captured exactly under an additional monotonicity constraint on $x_k \in \mathbb{R}^d$ 9.

Approximation rates in the scalar strictly-increasing case scale with the time horizon $x_{k+1} = f_k(x_k; \theta_k),$ 0. For $x_{k+1} = f_k(x_k; \theta_k),$ 1 (total variation of log-derivative of the target function), exact realization by flow maps is possible; otherwise, sup-norm error decays exponentially in $x_{k+1} = f_k(x_k; \theta_k),$ 2 (Li et al., 2019).

Finite-depth (discrete) ResNets necessarily incur discretization error controlled by the integration mesh:

$x_{k+1} = f_k(x_k; \theta_k),$ 3

so matching a fixed approximation accuracy $x_{k+1} = f_k(x_k; \theta_k),$ 4 requires $x_{k+1} = f_k(x_k; \theta_k),$ 5 layers.

3. Generalization, Depth Behavior, and Sample Complexity

From the dynamical systems perspective, generalization error of deep (residual) networks admits a sharp, uniform-in-depth analysis via flow maps and Rademacher complexity. In both discrete and continuous-time regimes, with inputs and parameters bounded and local Lipschitz loss, the generalization gap scales as $x_{k+1} = f_k(x_k; \theta_k),$ 6 with the number of samples $x_{k+1} = f_k(x_k; \theta_k),$ 7, and is independent of depth in the deep-layer limit (Huang et al., 24 Feb 2026):

$x_{k+1} = f_k(x_k; \theta_k),$ 8

An additional negative structure-dependent term provides further contraction, reflecting additional regularization beyond standard Lipschitz control.

The discrete-to-continuous limit can be controlled precisely; the forward map $x_{k+1} = f_k(x_k; \theta_k),$ 9 converges to continuous $f_k$ 0 as $f_k$ 1. Hence, generalization remains asymptotically stable (“depth-stable”), and this theory closes the gap between discrete and continuous sample complexities (Huang et al., 24 Feb 2026).

Empirical and spectral arguments show that ensuring the spectrum of local operators (Jacobian, forward or backward) is concentrated near unit modulus ("edge of chaos") maximizes generalization and expressivity while avoiding collapse (Zhang, 2023, Drgona et al., 2020).

4. Training Dynamics and Optimal Control Formulations

The evolution of network parameters during training can itself be cast as a discrete- or continuous-time dynamical system. Stochastic gradient descent (SGD) corresponds to a discretized (possibly stochastic) gradient flow:

$f_k$ 2

where $f_k$ 3 is empirical loss. In the continuous-time (small-step) limit, this approaches

$f_k$ 4

Stochasticity introduces Langevin noise, yielding SDEs that drive parameter distributions toward minimizers of a free-energy functional (Liu et al., 2019).

Training deep networks as optimal control problems is achieved by viewing layer parameters as "controls" steering state trajectories over a finite time horizon to minimize terminal cost (supervised loss plus regularization). Pontryagin's minimum principle then yields forward (state) ODEs and backward (adjoint) ODEs, and the shooting method (interpreting all learning as an initial-value boundary problem) enables particle-ensemble control parameterizations, dramatically reducing the number of free parameters for continuous-depth networks (Vialard et al., 2020, Furuhata et al., 2020).

Geometric and hydrodynamics approaches equate the necessary conditions of stochastic optimal control to mean-field PDEs (Fokker-Planck, Hamilton-Jacobi-Bellman) or even Lie–Poisson quantum Euler systems, yielding a rich set of structure-preserving numerical schemes (Ganaba, 2021, Ganaba, 2022).

5. Geometric, Thermodynamic, and Operator-Theoretic Extensions

The dynamical-systems view extends to the geometry of parameter space, notably for deep linear networks, where the overparameterized structure defines a fiber bundle with Riemannian metric and associated entropy. The training flow then projects to a Riemannian gradient flow for the end-to-end map, and stochastic gradient descent yields an SDE with implicit free energy regularization, biasing solutions toward high-entropy fibers in parameter space (Menon, 2024).

Operator perspectives, such as the Mori-Zwanzig formalism, recast deep networks as propagators on observables via Koopman or Frobenius-Perron operators. Memory effects of deep architectures can be captured by Generalized Langevin equations, and sufficient contraction yields exponential decay of memory, making rigorous the conversion of deep to shallow architectures or wide to thin ones by projection (Venturi et al., 2022).

6. Stability, Dissipativity, and Robust Design

Characterizing stability for networks seen as dynamical systems exploits classical dissipativity and contraction arguments. For discrete-time architectures, the pointwise affine form allows relating operator norms and, via the Banach fixed-point theorem, guarantees local asymptotic stability when the spectral radius of the linearized operator is less than one. Activations with bounded Lipschitz constant ( $f_k$ 5) and appropriately normalized weights ensure global dissipation and preclude exploding or vanishing states (Drgona et al., 2020).

In the stochastic/diffusive regime, multisymplectic and energy–Casimir techniques yield nonlinear stability conditions, which impose explicit constraints on permissible network depth, width, and node density in order to avoid discretization or model error blowup during training (Ganaba, 2022). Control-theoretic online adaptation similarly utilizes Lyapunov and super-twisting theory to design last-layer update laws with explicit boundedness and convergence guarantees, and spectral normalization of weights translates to uniform bounds on adaptation error (Elkins et al., 2024).

7. Implications for Architecture, Expressivity, and Future Directions

The dynamical systems framework informs the systematic design of deep architectures:

Layerwise propagation corresponds to the numerical integration of a differential (or integral-delay, or stochastic) system, suggesting distinct benefits of ODE, SDE, or delay-embedded architectures.
Expressivity and universal approximation are linked to phase-space dimension, memory capacity (via delay or augmentation), and the choice of vector field parameterizations and terminal maps (Liu et al., 9 Oct 2025, Chemnitz et al., 7 Jul 2025).
Depth, width, activation, and normalization critically control ergodicity and the FTLE spectrum, offering quantitative metrics for tuning networks at the edge of chaos for optimal learning (Zhang, 2023).
Stochastic and geometric frameworks open new routes to network reduction, uncertainty quantification, and structure-preserving learning, relevant for both interpretability and safety-critical domains.

Open directions include systematic exploitation of geometric invariants, thermodynamic entropy, operator-theoretic modeling of memory and irreversibility, and integration with control-theoretic online methods for robust adaptation under domain shift.

References:

(Li et al., 2019, Huang et al., 24 Feb 2026, Liu et al., 2019, Chemnitz et al., 7 Jul 2025, Liu et al., 9 Oct 2025, Duan et al., 2022, Drgona et al., 2020, Menon, 2024, Furuhata et al., 2020, Ganaba, 2021, Ganaba, 2022, Zhang, 2023, Venturi et al., 2022, Hauser et al., 2018, Elkins et al., 2024, Vialard et al., 2020, Spivak et al., 2021)