Neural Delay Differential Equations

Updated 11 December 2025

Neural Delay Differential Equations are data-driven models that extend neural ODEs by integrating current and delayed states to capture non-Markovian dynamics.
The framework jointly learns the underlying dynamics and delay parameters via gradient-based optimization, enabling flexible system identification in complex time-delay systems.
NDDEs have demonstrated strong predictive performance and universal approximation properties, making them a powerful tool in forecasting and inverse problem settings.

Neural Delay Differential Equations (NDDEs) are a class of data-driven models that extend neural ordinary differential equations (NODEs) by incorporating explicit dependence on delayed states in the governing dynamical system. Unlike NODEs, which model trajectories as solutions to ODEs dependent on the current state and possibly time, NDDEs introduce a vector field parameterized by a neural network that receives both present and past (delayed) states, enabling representation of non-Markovian dynamics intrinsic to delay differential equations. The NDDE framework is designed to simultaneously learn both the unknown dynamics and the delay parameters directly from data via gradient-based optimization, offering a differentiable, mesh-free, and flexible approach to system identification, forecasting, and inverse problems for a broad class of time-delay systems (Breda et al., 4 Dec 2025).

1. Mathematical Formulation

An NDDE approximates the right-hand side of a delay differential equation (DDE) with a neural network, and the delay parameters themselves are treated as learnable variables. The general autonomous form for a system with $k$ discrete delays reads: $x'(t) = f(x(t), x(t-\tau_1), \ldots, x(t-\tau_k))$ where $x(t) \in \mathbb{R}^n$ and $\tau_i \ge 0$ .

In the NDDE, this is modeled as: $\hat{x}'(t) = \operatorname{net}\big(\hat{x}(t), \hat{x}(t-\hat{\tau}_1), \ldots, \hat{x}(t-\hat{\tau}_k); \Theta \big)$ with $\{\hat{\tau}_i\}$ trainable delay scalars ( $0 \leq \hat{\tau}_i \leq \tau_{\max}$ ) and $\Theta$ collecting all neural network weights and biases. The initial condition is specified by a history function on $[\,-\tau_{\max},0]$ . All parameters, including the delays, are optimized jointly via backpropagation (Breda et al., 4 Dec 2025).

The NDDE can be extended to time-dependent, state-dependent, and multiple delays by making the delay functions themselves neural networks: $x'(t) = f_\theta\left(t, x(t), x(t - \tau_1(t, x(t))), \dots, x(t - \tau_k(t, x(t)))\right)$ where each $\tau_i$ is parameterized by a neural network $g_{\varphi_i}(t, x(t))$ (Monsel et al., 2023).

2. Neural Network Architectures

The typical NDDE architecture constructs a multilayer perceptron accepting the concatenated present and delayed states as input: $z_0(t) = [\hat{x}(t); \hat{x}(t-\hat{\tau}_1); \ldots; \hat{x}(t-\hat{\tau}_k)] \in \mathbb{R}^{n(k+1)}$ The layers are structured as: $z_{\ell}(t) = g_\ell \left(W_\ell z_{\ell-1}(t) + b_\ell \right),\quad \ell=1,\ldots,L$ with nonlinear activations (e.g., tanh, ReLU), and the last layer is linear.

The delays $\{\hat{\tau}_i\}$ are real parameters, generally initialized within $(0, \tau_{\max})$ and updated by gradient-based optimization (e.g., ADAM). Small networks (2–3 layers of 5–20 neurons each) suffice in many applications due to the strong inductive bias imposed by the delay structure. In state- or time-dependent NDDEs, separate MLPs parameterize each delay as a function of the current state and time (Breda et al., 4 Dec 2025, Monsel et al., 2023).

For high-dimensional problems (e.g., images), the NDDE is integrated with convolutional blocks: $f[h(t), h(t-\tau)] = \operatorname{ConvBlock}(\operatorname{concat}(h(t), h(t-\tau)))$ with channel-wise concatenation (Zhu et al., 2023).

3. Training Paradigms and Loss Functions

Training NDDEs requires defining loss functions that capture either the derivative field or the system’s simulated trajectory:

Derivative-Matching Loss (no integration): Given measurements $\{x(t_m), x'(t_m)\}$ ,

$\mathcal{L}_{x'} = \frac{1}{nM} \sum_{m=1}^M \left\| \operatorname{net}\left(x(t_m), x(t_m - \hat{\tau})\right) - x'(t_m) \right\|^2$

where observed values are substituted wherever possible (Breda et al., 4 Dec 2025).

Simulation Loss (integration required): For trajectories computed over $H$ steps,

$\mathcal{L}_x = \frac{1}{nMH} \sum_{m=1}^M \sum_{h=1}^H \|\hat{x}(t_m + h\Delta t) - x(t_m + h\Delta t)\|^2$

with $\hat{x}(\cdot)$ obtained by integrating the learned NDDE from the observed initial history.

Gradients are computed via delayed adjoint equations. For derivative-matching loss, the gradient with respect to a delay $\hat{\tau}_i$ takes the form: $\frac{\partial\mathcal{L}}{\partial \hat{\tau}_i} = -\left( \frac{\partial \mathcal{L}}{\partial x(t - \hat{\tau}_i)} \right) \cdot x'(t - \hat{\tau}_i)$ where $x'(t - \hat{\tau}_i)$ is obtained by linear interpolation on the history grid (Breda et al., 4 Dec 2025). Optimization proceeds via ADAM, and delays are clipped to enforce positivity and possible upper bounds.

4. Integration, Adjoint Sensitivity, and Implementation

For simulation-based training, a standard DDE solver (e.g., MATLAB’s dde23) or a specialized continuous-time integrator is employed, which interpolates historical states as needed. The adjoint sensitivity method extends the classic ODE approach to delayed arguments: $\frac{d a(t)}{dt} = -[\partial_{x(t)} f_{\theta}]^{T} a(t) - \sum_{i=1}^k [\partial_{x(t-\tau_i)} f_\theta]^T a(t+\tau_i)$ with appropriate terminal conditions on $a(t)$ . Gradients with respect to both network weights and delays are computed by backpropagating through the DDE and the interpolated delayed arguments (Oprea et al., 2023, Monsel et al., 2023).

A practical high-level pseudocode for derivative-matching loss is:

Initialize Θ, delays {τ_i} ∈ (0, τ_max)
for q in 1...q_max:
    Sample minibatch of times {t_m}
    Interpolate x(t_m - τ_i) from data
    Predict x'(t_m) = net(x(t_m), x(t_m - τ1), ..., x(t_m - τk))
    Compute loss
    Backprop to get gradients for Θ and τ_i
    Update Θ and τ_i with ADAM; project τ_i to [0, τ_max]
Return best Θ, τ

(Breda et al., 4 Dec 2025)

5. Illustrative Example: Delay Logistic System

The delay logistic equation (Hutchinson system) serves as a prototypical test case: $x'(t) = r x(t) \left[1 - \frac{x(t - \tau)}{K}\right]; \quad r=1.8,~K=1,~\tau=1$ For NDDE learning:

MLP with 2 hidden layers (10 units each, tanh activation)
Derivative loss $\mathcal{L}_{x'}$ , minibatch size 64, ADAM optimizer, 8000 iterations
Training on $[0,30]$ , test on $[30,35]$ , $\Delta t=0.5$

Outcomes:

Learned delay $\hat{\tau}\approx 0.995$ (1% error)
Test RMSE $_{x'} \approx 2\times 10^{-2}$ , RMSE $_x \approx 2\times 10^{-2}$ (5s horizon)
Fast convergence (<5s on a desktop) (Breda et al., 4 Dec 2025)

6. Theoretical Properties: Expressivity and Universal Approximation

Multiple works rigorously establish the universal approximation property for NDDEs: for any continuous map $F:\mathbb{R}^n \to \mathbb{R}^n$ , there exists a neural vector field $f$ and a delay $\tau$ such that integrating the NDDE from a given initial history over the appropriate interval yields $x(T) = F(x_0)$ . The constructive proof exploits the constant-history regime and neural network approximation of $G(x) = (F(x) - x)/T$ (Zhu et al., 2021).

Extension to multiple discrete or even piecewise-constant delays (as in Neural PCDDEs (Zhu et al., 2022)) preserves the universal approximation property and supports mapping flows inaccessible to finite-dimensional ODEs—examples include systems with intersecting trajectories, chaotic attractors, and classification tasks where NODEs fail (Zhu et al., 2023, Breda et al., 4 Dec 2025).

7. Connections, Advantages, Limitations, and Future Directions

Aspect	NDDE	SINDy with Delays
Dynamics learning	Jointly learns $f$ and $\tau$ via backprop	Requires basis library, external optimizers
Interpretability	Flexible but less interpretable	Sparse, interpretable algebraic form
Scaling	Gradient-based, many parameters	Manual basis design, may be costly
Data requirement	Needs sufficient training data	Basis richness and identifiability dependent

NDDEs offer simultaneous identification of delays and nonlinear dynamics via gradient-based optimization, which can scale and adapt to complex, arbitrary nonlinearities without requiring manual library selection. Key advantages include flexibility, the ability to incorporate physically informed structures (e.g., network channels corresponding to known variables), and extensibility to time- and state-dependent delays (Breda et al., 4 Dec 2025, Monsel et al., 2023).

Limitations include potential trapping in local minima (e.g., learning numerical differentiation rather than true delayed dynamics if data are not sufficiently rich), reduced transparency relative to sparse regression methods, and sensitivity to the weighting between derivative-matching and simulation loss. Interpretability can be improved by imposing sparsity penalties or employing hybrid NDDE–SINDy architectures (Breda et al., 4 Dec 2025).

Future directions highlighted include:

Hybridization with sparse regression for interpretability
Systematic analysis of data richness versus identifiability
Extension to distributed, variable, or state-dependent delays
Refinement of training heuristics for robust delay recovery (Breda et al., 4 Dec 2025)

NDDEs constitute a central approach within the modern landscape of data-driven methods for time-delay systems, offering an end-to-end differentiable, highly expressive, and practically effective framework for modeling complex non-Markovian dynamical phenomena from data.