Universal Approximation Theorem for Deep Q-Learning via FBSDE System (2505.06023v1)

Published 9 May 2025 in cs.LG, cs.AI, and math.OC

Abstract: The approximation capabilities of Deep Q-Networks (DQNs) are commonly justified by general Universal Approximation Theorems (UATs) that do not leverage the intrinsic structural properties of the optimal Q-function, the solution to a Bellman equation. This paper establishes a UAT for a class of DQNs whose architecture is designed to emulate the iterative refinement process inherent in Bellman updates. A central element of our analysis is the propagation of regularity: while the transformation induced by a single Bellman operator application exhibits regularity, for which Backward Stochastic Differential Equations (BSDEs) theory provides analytical tools, the uniform regularity of the entire sequence of value iteration iterates--specifically, their uniform Lipschitz continuity on compact domains under standard Lipschitz assumptions on the problem data--is derived from finite-horizon dynamic programming principles. We demonstrate that layers of a deep residual network, conceived as neural operators acting on function spaces, can approximate the action of the Bellman operator. The resulting approximation theorem is thus intrinsically linked to the control problem's structure, offering a proof technique wherein network depth directly corresponds to iterations of value function refinement, accompanied by controlled error propagation. This perspective reveals a dynamic systems view of the network's operation on a space of value functions.

Summary

The paper establishes a universal approximation theorem for DQN architectures by demonstrating that a residual network can approximate the optimal Q-function through iterative Bellman updates.
It leverages the fixed-point property of the Bellman operator and BSDE analysis to ensure uniform Lipschitz continuity and boundedness across Q-functions.
Practical insights include designing neural operator blocks within residual networks and addressing high-dimensional challenges like the curse of dimensionality in MDP discretization.

This paper, "Universal Approximation Theorem for Deep Q-Learning via FBSDE System" (2505.06023), addresses a fundamental question in Deep Reinforcement Learning (DRL): why can Deep Q-Networks (DQNs) effectively approximate the optimal action-value function $Q^*$ ? Unlike general Universal Approximation Theorems (UATs) that state neural networks can approximate any continuous function, this work provides a UAT specifically tailored to DQNs by leveraging the inherent structure of $Q^*$ as the fixed point of the Bellman optimality operator.

The core idea is to design and analyze a DQN architecture that mimics the value iteration process $Q^{(k+1)} = B Q^{(k)}$ , where $B$ is the Bellman operator. The paper considers a continuous-time Markov Decision Process (MDP) formulation discretized over small time intervals $\delta$ . The optimal Q-function $Q^*(t,s,a)$ for this $\delta$ -discretized problem is the unique fixed point of the Bellman operator defined on the compact domain $K_Q = [0,T] \times S \times A$ .

The proposed DQN architecture takes a residual network form operating on a representation of the Q-function. If $\hat{Q}^{(l)}$ is the network's estimate of the Q-function after layer $l$ , the next estimate is $\hat{Q}^{(l+1)} = \hat{Q}^{(l)} + \Delta^{(l)}(\hat{Q}^{(l)})$ , where $\Delta^{(l)}$ is the function computed by layer $l$ . The goal is for $\Delta^{(l)}(\hat{Q}^{(l)})$ to approximate $J(\hat{Q}^{(l)}) = B \hat{Q}^{(l)} - \hat{Q}^{(l)}$ . Thus, each layer is designed to approximate the residual difference between the current Q-function estimate and the function resulting from one Bellman update. This means $\hat{Q}^{(l+1)}$ aims to approximate $B \hat{Q}^{(l)}$ .

A central technical contribution is the proof that, under standard Lipschitz continuity and boundedness assumptions on the MDP's dynamics ( $h, \sigma$ ), rewards ( $r$ ), and terminal conditions ( $g$ ) (Assumption 2.1), the exact Bellman iterates $Q^{(k)}$ and the optimal function $Q^*$ are uniformly Lipschitz continuous and uniformly bounded on the compact domain $K_Q$ . This property is crucial because it guarantees that the sequence of functions $\{Q^{(k)}\}_{k \ge 0} \cup \{Q^*\}$ lies within a compact subset of the space of continuous functions $C(K_Q)$ . The analysis of the regularity of a single Bellman step is informed by connections to Backward Stochastic Differential Equations (BSDEs), which provide tools to establish Lipschitz continuity of the value function with respect to its arguments under appropriate conditions.

The paper then introduces the concept of a "neural operator" as a function $\tilde{\mathcal{G}}(Q) = \mathcal{D}_M(N(\mathcal{E}_M(Q)))$ that maps an input function $Q$ (represented by its values $\mathcal{E}_M(Q)$ on a finite grid $D_M$ ) to an output function (reconstructed by $\mathcal{D}_M$ from the output of a neural network $N$ ). The UAT established relies on the assumption (Assumption 4.1) that for any compact set of functions $\mathcal{K}$ consisting of uniformly Lipschitz functions, and any continuous operator $\mathcal{G}$ mapping $\mathcal{K}$ to $C(K_Q)$ such that $\mathcal{G}(\mathcal{K})$ is also uniformly Lipschitz, there exists such a neural operator $\tilde{\mathcal{G}}$ that can approximate $\mathcal{G}$ arbitrarily well in the supremum norm on $\mathcal{K}$ . Crucially, this assumption also requires that the approximating neural operator $\tilde{\mathcal{G}}(Q)$ itself, as a function on $K_Q$ , is Lipschitz continuous with a controllable uniform Lipschitz constant $L_F^*$ .

The main theorem (Theorem 4.3) states that for any desired accuracy $\epsilon > 0$ , there exists a number of layers $L$ and a suitable neural operator architecture (defined by a discretization $D_M$ and neural networks $N_{_l}$ for each layer $l$ ) such that the final output of the $L$ -layer DQN, $\hat{Q}^{(L)}$ , approximates $Q^*$ with ${ \hat{Q}^{(L)} - Q^* } < \epsilon$ .

The proof combines the convergence of value iteration ( $Q^{(L)} \to Q^*$ ) with the approximation power and stability of the neural operators. The total error is decomposed into the value iteration truncation error ${Q^{(L)} - Q^*}$ and the neural network approximation error ${ \hat{Q}^{(L)} - Q^{(L)}}$ . \begin{itemize} \item The value iteration error decreases exponentially with the number of layers $L$ due to the Bellman operator's contractivity ( ${ Q^{(L)} - Q^* } \le (e^{-\lambda\delta})^L { Q^{(0)} - Q^* }$ ). $L$ is chosen large enough to make this error small. \item The network approximation error is the accumulated error from each layer. The error propagation is analyzed: $e_{l+1} = { \hat{Q}^{(l+1)} - Q^{(l+1)} } \le e^{-\lambda\delta} e_l + { \delta_l }$ , where $\delta_l$ is the per-layer operator approximation error. If each layer's operator block approximates the target residual operator $J$ with accuracy $\epsilon_1$ on its input function $\hat{Q}^{(l)}$ , the total error $e_L$ is bounded by approximately $\epsilon_1 / (1-e^{-\lambda\delta})$ . \end{itemize} The ability to bound $e_L$ by controlling $\epsilon_1$ requires applying the neural operator UAT (Lemma 4.2) at each layer. This is possible because the set of input functions to each layer, $\{\hat{Q}^{(l)}\}_{l=0}^{L-1}$ , is shown to reside within a common compact set of uniformly bounded and uniformly Lipschitz functions (the set $\mathcal{K}_{\text{target})$. The uniform Lipschitz property of $\hat{Q}^{(l)}$ is maintained because the function implemented by each neural operator block has a controlled Lipschitz constant $L_F^*$ , preventing the Lipschitz constant from growing unboundedly across layers, despite the $(2K_B+1)$ factor in the recurrence for Lipschitz constants $L_{\hat{Q}^{(l+1)}} \le (2K_A+L_F^*) + (2K_B+1)L_{\hat{Q}^{(l)}}$ .

Practical Implications and Implementation:

Architecture Design: The paper suggests implementing a DQN as a deep residual network where each block is a neural operator aiming to learn the mapping $Q \mapsto BQ-Q$ . This is a specific structural inductive bias derived from the problem's Bellman structure.
Function Representation: The Q-function $Q(t,s,a)$ $Q (t, s, a)$ lives in a high-dimensional space (1+state dim + action dim). The network operates on a discretized representation. Implementation requires choosing:
- A discretization grid $D_M$ for $K_Q$ . This could be a uniform grid, but the dimension $d_Q = 1+n+m$ can be large, leading to the curse of dimensionality (CoD) for grid size $M \sim (1/\text{mesh size})^{d_Q}$ .
- Encoding $\mathcal{E}_M$ (e.g., simple point sampling).
- Decoding $\mathcal{D}_M$ (e.g., multilinear interpolation or a learned decoder).
Neural Operator Implementation: Each block $\Delta^{(l)}$ corresponds to a neural network $N_{_l}$ operating on the finite-dimensional vector $\mathcal{E}_M(\hat{Q}^{(l)})$ , followed by a decoder $\mathcal{D}_M$ . The design of $N_{_l}$ should consider architectures known to work well for learning mappings between finite-dimensional representations of functions (e.g., MLPs, or potentially more specialized networks inspired by convolutional or graph operators if applicable to the grid structure). Crucially, the composition $\mathcal{D}_M \circ N_{_l} \circ \mathcal{E}_M$ must produce a function with a controlled Lipschitz constant to ensure the stability of $\hat{Q}^{(l)}$ 's regularity. This might require constraints on $N_{_l}$ and $\mathcal{D}_M$ (e.g., spectral normalization on $N_{_l}$ , bounded output of $N_{_l}$ , Lipschitz basis functions for $\mathcal{D}_M$ ).
Training Strategy: The paper's UAT doesn't specify the learning algorithm. Standard DQN training techniques (like experience replay, target networks) would likely be needed to train this architecture in an RL setting. The loss function would likely be based on minimizing the Bellman error, possibly per layer or end-to-end.
Addressing CoD: The quantitative rates discussion highlights that standard grid methods lead to an exponential dependence of $M$ on $d_Q$ . Mitigating CoD in practice would require exploring more advanced discretization techniques (e.g., sparse grids if sufficient mixed smoothness is present) or representation schemes (e.g., low-rank tensors, non-grid methods like spectral methods or random features) if the problem structure allows. Proving the required higher-order regularity for $Q^*$ under realistic MDP assumptions is an open research direction.
Limitations: The Lipschitz assumptions on MDP coefficients are strong for some real-world scenarios. The theorem proves existence of an approximating network, not the practical learnability via optimization algorithms or convergence guarantees for the RL training process itself.

In summary, the paper provides a structure-aware UAT for DQNs by viewing the network as an iterative function refiner based on the Bellman principle. It theoretically justifies why deep architectures, specifically residual-like ones using neural operators, are suitable for approximating the structured function $Q^*$ , provided the underlying functions maintain sufficient regularity during the iterative process. This regularity, guaranteed by the control problem's properties and preserved by the proposed neural operator architecture, is key to the theorem.