Papers
Topics
Authors
Recent
2000 character limit reached

Universal Approximation Theorem for Deep Q-Learning via FBSDE System (2505.06023v1)

Published 9 May 2025 in cs.LG, cs.AI, and math.OC

Abstract: The approximation capabilities of Deep Q-Networks (DQNs) are commonly justified by general Universal Approximation Theorems (UATs) that do not leverage the intrinsic structural properties of the optimal Q-function, the solution to a Bellman equation. This paper establishes a UAT for a class of DQNs whose architecture is designed to emulate the iterative refinement process inherent in Bellman updates. A central element of our analysis is the propagation of regularity: while the transformation induced by a single Bellman operator application exhibits regularity, for which Backward Stochastic Differential Equations (BSDEs) theory provides analytical tools, the uniform regularity of the entire sequence of value iteration iterates--specifically, their uniform Lipschitz continuity on compact domains under standard Lipschitz assumptions on the problem data--is derived from finite-horizon dynamic programming principles. We demonstrate that layers of a deep residual network, conceived as neural operators acting on function spaces, can approximate the action of the Bellman operator. The resulting approximation theorem is thus intrinsically linked to the control problem's structure, offering a proof technique wherein network depth directly corresponds to iterations of value function refinement, accompanied by controlled error propagation. This perspective reveals a dynamic systems view of the network's operation on a space of value functions.

Summary

  • The paper establishes a universal approximation theorem for DQN architectures by demonstrating that a residual network can approximate the optimal Q-function through iterative Bellman updates.
  • It leverages the fixed-point property of the Bellman operator and BSDE analysis to ensure uniform Lipschitz continuity and boundedness across Q-functions.
  • Practical insights include designing neural operator blocks within residual networks and addressing high-dimensional challenges like the curse of dimensionality in MDP discretization.

This paper, "Universal Approximation Theorem for Deep Q-Learning via FBSDE System" (2505.06023), addresses a fundamental question in Deep Reinforcement Learning (DRL): why can Deep Q-Networks (DQNs) effectively approximate the optimal action-value function Q∗Q^*? Unlike general Universal Approximation Theorems (UATs) that state neural networks can approximate any continuous function, this work provides a UAT specifically tailored to DQNs by leveraging the inherent structure of Q∗Q^* as the fixed point of the Bellman optimality operator.

The core idea is to design and analyze a DQN architecture that mimics the value iteration process Q(k+1)=BQ(k)Q^{(k+1)} = B Q^{(k)}, where BB is the Bellman operator. The paper considers a continuous-time Markov Decision Process (MDP) formulation discretized over small time intervals δ\delta. The optimal Q-function Q∗(t,s,a)Q^*(t,s,a) for this δ\delta-discretized problem is the unique fixed point of the Bellman operator defined on the compact domain KQ=[0,T]×S×AK_Q = [0,T] \times S \times A.

The proposed DQN architecture takes a residual network form operating on a representation of the Q-function. If Q^(l)\hat{Q}^{(l)} is the network's estimate of the Q-function after layer ll, the next estimate is Q^(l+1)=Q^(l)+Δ(l)(Q^(l))\hat{Q}^{(l+1)} = \hat{Q}^{(l)} + \Delta^{(l)}(\hat{Q}^{(l)}), where Δ(l)\Delta^{(l)} is the function computed by layer ll. The goal is for Δ(l)(Q^(l))\Delta^{(l)}(\hat{Q}^{(l)}) to approximate J(Q^(l))=BQ^(l)−Q^(l)J(\hat{Q}^{(l)}) = B \hat{Q}^{(l)} - \hat{Q}^{(l)}. Thus, each layer is designed to approximate the residual difference between the current Q-function estimate and the function resulting from one Bellman update. This means Q^(l+1)\hat{Q}^{(l+1)} aims to approximate BQ^(l)B \hat{Q}^{(l)}.

A central technical contribution is the proof that, under standard Lipschitz continuity and boundedness assumptions on the MDP's dynamics (h,σh, \sigma), rewards (rr), and terminal conditions (gg) (Assumption 2.1), the exact Bellman iterates Q(k)Q^{(k)} and the optimal function Q∗Q^* are uniformly Lipschitz continuous and uniformly bounded on the compact domain KQK_Q. This property is crucial because it guarantees that the sequence of functions {Q(k)}k≥0∪{Q∗}\{Q^{(k)}\}_{k \ge 0} \cup \{Q^*\} lies within a compact subset of the space of continuous functions C(KQ)C(K_Q). The analysis of the regularity of a single Bellman step is informed by connections to Backward Stochastic Differential Equations (BSDEs), which provide tools to establish Lipschitz continuity of the value function with respect to its arguments under appropriate conditions.

The paper then introduces the concept of a "neural operator" as a function G~(Q)=DM(N(EM(Q)))\tilde{\mathcal{G}}(Q) = \mathcal{D}_M(N(\mathcal{E}_M(Q))) that maps an input function QQ (represented by its values EM(Q)\mathcal{E}_M(Q) on a finite grid DMD_M) to an output function (reconstructed by DM\mathcal{D}_M from the output of a neural network NN). The UAT established relies on the assumption (Assumption 4.1) that for any compact set of functions K\mathcal{K} consisting of uniformly Lipschitz functions, and any continuous operator G\mathcal{G} mapping K\mathcal{K} to C(KQ)C(K_Q) such that G(K)\mathcal{G}(\mathcal{K}) is also uniformly Lipschitz, there exists such a neural operator G~\tilde{\mathcal{G}} that can approximate G\mathcal{G} arbitrarily well in the supremum norm on K\mathcal{K}. Crucially, this assumption also requires that the approximating neural operator G~(Q)\tilde{\mathcal{G}}(Q) itself, as a function on KQK_Q, is Lipschitz continuous with a controllable uniform Lipschitz constant LF∗L_F^*.

The main theorem (Theorem 4.3) states that for any desired accuracy ϵ>0\epsilon > 0, there exists a number of layers LL and a suitable neural operator architecture (defined by a discretization DMD_M and neural networks NlN_{_l} for each layer ll) such that the final output of the LL-layer DQN, Q^(L)\hat{Q}^{(L)}, approximates Q∗Q^* with Q^(L)−Q∗<ϵ{ \hat{Q}^{(L)} - Q^* } < \epsilon.

The proof combines the convergence of value iteration (Q(L)→Q∗Q^{(L)} \to Q^*) with the approximation power and stability of the neural operators. The total error is decomposed into the value iteration truncation error Q(L)−Q∗{Q^{(L)} - Q^*} and the neural network approximation error Q^(L)−Q(L){ \hat{Q}^{(L)} - Q^{(L)}}. \begin{itemize} \item The value iteration error decreases exponentially with the number of layers LL due to the Bellman operator's contractivity (Q(L)−Q∗≤(e−λδ)LQ(0)−Q∗{ Q^{(L)} - Q^* } \le (e^{-\lambda\delta})^L { Q^{(0)} - Q^* }). LL is chosen large enough to make this error small. \item The network approximation error is the accumulated error from each layer. The error propagation is analyzed: el+1=Q^(l+1)−Q(l+1)≤e−λδel+δle_{l+1} = { \hat{Q}^{(l+1)} - Q^{(l+1)} } \le e^{-\lambda\delta} e_l + { \delta_l }, where δl\delta_l is the per-layer operator approximation error. If each layer's operator block approximates the target residual operator JJ with accuracy ϵ1\epsilon_1 on its input function Q^(l)\hat{Q}^{(l)}, the total error eLe_L is bounded by approximately ϵ1/(1−e−λδ)\epsilon_1 / (1-e^{-\lambda\delta}). \end{itemize} The ability to bound eLe_L by controlling ϵ1\epsilon_1 requires applying the neural operator UAT (Lemma 4.2) at each layer. This is possible because the set of input functions to each layer, {Q^(l)}l=0L−1\{\hat{Q}^{(l)}\}_{l=0}^{L-1}, is shown to reside within a common compact set of uniformly bounded and uniformly Lipschitz functions (the set $\mathcal{K}_{\text{target})$. The uniform Lipschitz property of Q^(l)\hat{Q}^{(l)} is maintained because the function implemented by each neural operator block has a controlled Lipschitz constant LF∗L_F^*, preventing the Lipschitz constant from growing unboundedly across layers, despite the (2KB+1)(2K_B+1) factor in the recurrence for Lipschitz constants LQ^(l+1)≤(2KA+LF∗)+(2KB+1)LQ^(l)L_{\hat{Q}^{(l+1)}} \le (2K_A+L_F^*) + (2K_B+1)L_{\hat{Q}^{(l)}}.

Practical Implications and Implementation:

  • Architecture Design: The paper suggests implementing a DQN as a deep residual network where each block is a neural operator aiming to learn the mapping Q↦BQ−QQ \mapsto BQ-Q. This is a specific structural inductive bias derived from the problem's Bellman structure.
  • Function Representation: The Q-function Q(t,s,a)Q(t,s,a) lives in a high-dimensional space (1+state dim + action dim). The network operates on a discretized representation. Implementation requires choosing:
    • A discretization grid DMD_M for KQK_Q. This could be a uniform grid, but the dimension dQ=1+n+md_Q = 1+n+m can be large, leading to the curse of dimensionality (CoD) for grid size M∼(1/mesh size)dQM \sim (1/\text{mesh size})^{d_Q}.
    • Encoding EM\mathcal{E}_M (e.g., simple point sampling).
    • Decoding DM\mathcal{D}_M (e.g., multilinear interpolation or a learned decoder).
  • Neural Operator Implementation: Each block Δ(l)\Delta^{(l)} corresponds to a neural network NlN_{_l} operating on the finite-dimensional vector EM(Q^(l))\mathcal{E}_M(\hat{Q}^{(l)}), followed by a decoder DM\mathcal{D}_M. The design of NlN_{_l} should consider architectures known to work well for learning mappings between finite-dimensional representations of functions (e.g., MLPs, or potentially more specialized networks inspired by convolutional or graph operators if applicable to the grid structure). Crucially, the composition DM∘Nl∘EM\mathcal{D}_M \circ N_{_l} \circ \mathcal{E}_M must produce a function with a controlled Lipschitz constant to ensure the stability of Q^(l)\hat{Q}^{(l)}'s regularity. This might require constraints on NlN_{_l} and DM\mathcal{D}_M (e.g., spectral normalization on NlN_{_l}, bounded output of NlN_{_l}, Lipschitz basis functions for DM\mathcal{D}_M).
  • Training Strategy: The paper's UAT doesn't specify the learning algorithm. Standard DQN training techniques (like experience replay, target networks) would likely be needed to train this architecture in an RL setting. The loss function would likely be based on minimizing the Bellman error, possibly per layer or end-to-end.
  • Addressing CoD: The quantitative rates discussion highlights that standard grid methods lead to an exponential dependence of MM on dQd_Q. Mitigating CoD in practice would require exploring more advanced discretization techniques (e.g., sparse grids if sufficient mixed smoothness is present) or representation schemes (e.g., low-rank tensors, non-grid methods like spectral methods or random features) if the problem structure allows. Proving the required higher-order regularity for Q∗Q^* under realistic MDP assumptions is an open research direction.
  • Limitations: The Lipschitz assumptions on MDP coefficients are strong for some real-world scenarios. The theorem proves existence of an approximating network, not the practical learnability via optimization algorithms or convergence guarantees for the RL training process itself.

In summary, the paper provides a structure-aware UAT for DQNs by viewing the network as an iterative function refiner based on the Bellman principle. It theoretically justifies why deep architectures, specifically residual-like ones using neural operators, are suitable for approximating the structured function Q∗Q^*, provided the underlying functions maintain sufficient regularity during the iterative process. This regularity, guaranteed by the control problem's properties and preserved by the proposed neural operator architecture, is key to the theorem.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 59 likes about this paper.