- The paper establishes a universal approximation theorem for DQN architectures by demonstrating that a residual network can approximate the optimal Q-function through iterative Bellman updates.
- It leverages the fixed-point property of the Bellman operator and BSDE analysis to ensure uniform Lipschitz continuity and boundedness across Q-functions.
- Practical insights include designing neural operator blocks within residual networks and addressing high-dimensional challenges like the curse of dimensionality in MDP discretization.
This paper, "Universal Approximation Theorem for Deep Q-Learning via FBSDE System" (2505.06023), addresses a fundamental question in Deep Reinforcement Learning (DRL): why can Deep Q-Networks (DQNs) effectively approximate the optimal action-value function Q∗? Unlike general Universal Approximation Theorems (UATs) that state neural networks can approximate any continuous function, this work provides a UAT specifically tailored to DQNs by leveraging the inherent structure of Q∗ as the fixed point of the Bellman optimality operator.
The core idea is to design and analyze a DQN architecture that mimics the value iteration process Q(k+1)=BQ(k), where B is the Bellman operator. The paper considers a continuous-time Markov Decision Process (MDP) formulation discretized over small time intervals δ. The optimal Q-function Q∗(t,s,a) for this δ-discretized problem is the unique fixed point of the Bellman operator defined on the compact domain KQ​=[0,T]×S×A.
The proposed DQN architecture takes a residual network form operating on a representation of the Q-function. If Q^​(l) is the network's estimate of the Q-function after layer l, the next estimate is Q^​(l+1)=Q^​(l)+Δ(l)(Q^​(l)), where Δ(l) is the function computed by layer l. The goal is for Δ(l)(Q^​(l)) to approximate J(Q^​(l))=BQ^​(l)−Q^​(l). Thus, each layer is designed to approximate the residual difference between the current Q-function estimate and the function resulting from one Bellman update. This means Q^​(l+1) aims to approximate BQ^​(l).
A central technical contribution is the proof that, under standard Lipschitz continuity and boundedness assumptions on the MDP's dynamics (h,σ), rewards (r), and terminal conditions (g) (Assumption 2.1), the exact Bellman iterates Q(k) and the optimal function Q∗ are uniformly Lipschitz continuous and uniformly bounded on the compact domain KQ​. This property is crucial because it guarantees that the sequence of functions {Q(k)}k≥0​∪{Q∗} lies within a compact subset of the space of continuous functions C(KQ​). The analysis of the regularity of a single Bellman step is informed by connections to Backward Stochastic Differential Equations (BSDEs), which provide tools to establish Lipschitz continuity of the value function with respect to its arguments under appropriate conditions.
The paper then introduces the concept of a "neural operator" as a function G~​(Q)=DM​(N(EM​(Q))) that maps an input function Q (represented by its values EM​(Q) on a finite grid DM​) to an output function (reconstructed by DM​ from the output of a neural network N). The UAT established relies on the assumption (Assumption 4.1) that for any compact set of functions K consisting of uniformly Lipschitz functions, and any continuous operator G mapping K to C(KQ​) such that G(K) is also uniformly Lipschitz, there exists such a neural operator G~​ that can approximate G arbitrarily well in the supremum norm on K. Crucially, this assumption also requires that the approximating neural operator G~​(Q) itself, as a function on KQ​, is Lipschitz continuous with a controllable uniform Lipschitz constant LF∗​.
The main theorem (Theorem 4.3) states that for any desired accuracy ϵ>0, there exists a number of layers L and a suitable neural operator architecture (defined by a discretization DM​ and neural networks Nl​​ for each layer l) such that the final output of the L-layer DQN, Q^​(L), approximates Q∗ with Q^​(L)−Q∗<ϵ.
The proof combines the convergence of value iteration (Q(L)→Q∗) with the approximation power and stability of the neural operators. The total error is decomposed into the value iteration truncation error Q(L)−Q∗ and the neural network approximation error Q^​(L)−Q(L).
\begin{itemize}
\item The value iteration error decreases exponentially with the number of layers L due to the Bellman operator's contractivity (Q(L)−Q∗≤(e−λδ)LQ(0)−Q∗). L is chosen large enough to make this error small.
\item The network approximation error is the accumulated error from each layer. The error propagation is analyzed: el+1​=Q^​(l+1)−Q(l+1)≤e−λδel​+δl​, where δl​ is the per-layer operator approximation error. If each layer's operator block approximates the target residual operator J with accuracy ϵ1​ on its input function Q^​(l), the total error eL​ is bounded by approximately ϵ1​/(1−e−λδ).
\end{itemize}
The ability to bound eL​ by controlling ϵ1​ requires applying the neural operator UAT (Lemma 4.2) at each layer. This is possible because the set of input functions to each layer, {Q^​(l)}l=0L−1​, is shown to reside within a common compact set of uniformly bounded and uniformly Lipschitz functions (the set $\mathcal{K}_{\text{target})$. The uniform Lipschitz property of Q^​(l) is maintained because the function implemented by each neural operator block has a controlled Lipschitz constant LF∗​, preventing the Lipschitz constant from growing unboundedly across layers, despite the (2KB​+1) factor in the recurrence for Lipschitz constants LQ^​(l+1)​≤(2KA​+LF∗​)+(2KB​+1)LQ^​(l)​.
Practical Implications and Implementation:
- Architecture Design: The paper suggests implementing a DQN as a deep residual network where each block is a neural operator aiming to learn the mapping Q↦BQ−Q. This is a specific structural inductive bias derived from the problem's Bellman structure.
- Function Representation: The Q-function Q(t,s,a) lives in a high-dimensional space (1+state dim + action dim). The network operates on a discretized representation. Implementation requires choosing:
- A discretization grid DM​ for KQ​. This could be a uniform grid, but the dimension dQ​=1+n+m can be large, leading to the curse of dimensionality (CoD) for grid size M∼(1/mesh size)dQ​.
- Encoding EM​ (e.g., simple point sampling).
- Decoding DM​ (e.g., multilinear interpolation or a learned decoder).
- Neural Operator Implementation: Each block Δ(l) corresponds to a neural network Nl​​ operating on the finite-dimensional vector EM​(Q^​(l)), followed by a decoder DM​. The design of Nl​​ should consider architectures known to work well for learning mappings between finite-dimensional representations of functions (e.g., MLPs, or potentially more specialized networks inspired by convolutional or graph operators if applicable to the grid structure). Crucially, the composition DM​∘Nl​​∘EM​ must produce a function with a controlled Lipschitz constant to ensure the stability of Q^​(l)'s regularity. This might require constraints on Nl​​ and DM​ (e.g., spectral normalization on Nl​​, bounded output of Nl​​, Lipschitz basis functions for DM​).
- Training Strategy: The paper's UAT doesn't specify the learning algorithm. Standard DQN training techniques (like experience replay, target networks) would likely be needed to train this architecture in an RL setting. The loss function would likely be based on minimizing the Bellman error, possibly per layer or end-to-end.
- Addressing CoD: The quantitative rates discussion highlights that standard grid methods lead to an exponential dependence of M on dQ​. Mitigating CoD in practice would require exploring more advanced discretization techniques (e.g., sparse grids if sufficient mixed smoothness is present) or representation schemes (e.g., low-rank tensors, non-grid methods like spectral methods or random features) if the problem structure allows. Proving the required higher-order regularity for Q∗ under realistic MDP assumptions is an open research direction.
- Limitations: The Lipschitz assumptions on MDP coefficients are strong for some real-world scenarios. The theorem proves existence of an approximating network, not the practical learnability via optimization algorithms or convergence guarantees for the RL training process itself.
In summary, the paper provides a structure-aware UAT for DQNs by viewing the network as an iterative function refiner based on the Bellman principle. It theoretically justifies why deep architectures, specifically residual-like ones using neural operators, are suitable for approximating the structured function Q∗, provided the underlying functions maintain sufficient regularity during the iterative process. This regularity, guaranteed by the control problem's properties and preserved by the proposed neural operator architecture, is key to the theorem.