Chebyshev-DQN: Integrating Chebyshev Polynomials in DQN

Updated 3 January 2026

Chebyshev-DQN is a reinforcement learning variant that leverages Chebyshev polynomial bases to reduce representation bias and improve numerical conditioning in Q-learning.
It introduces a Chebyshev feature layer that normalizes state inputs and maps them into polynomial features, enabling more stable and efficient network updates.
Empirical evaluations on benchmarks like CartPole-v1 demonstrate that optimal Chebyshev degree selection boosts convergence and average rewards while preventing overfitting.

Chebyshev-DQN (Ch-DQN) is a variant of the Deep Q-Network (DQN) reinforcement learning algorithm that explicitly incorporates Chebyshev polynomial bases into the value function approximator. The approach is motivated by the superior approximation and stability properties of orthogonal polynomials, aiming to reduce representation bias and improve the numerical conditioning of Q-learning updates. Chebyshev-DQN has demonstrated improved asymptotic and stability performance on standard benchmarks such as CartPole-v1, contingent on appropriate hyperparameterization of the polynomial degree (Yazdannik et al., 20 Aug 2025).

1. Mathematical Foundation of Chebyshev-DQN

Chebyshev polynomials of the first kind, $T_n(x)$ , constitute a set of orthogonal polynomials defined on the interval $[-1,1]$ . They are recursively defined by: $T_0(x)=1,\quad T_1(x)=x,\quad T_{n+1}(x)=2\,x\,T_n(x)\;-\;T_{n-1}(x) \quad (n\ge1)$ Alternatively, $T_n(x)=\cos(n\arccos x)$ . Orthogonality is given by: $\int_{-1}^1 T_n(x) T_m(x) \frac{dx}{\sqrt{1-x^2}} = 0 \text{ for } n \neq m$ Functional approximation with Chebyshev polynomials is motivated by their minimax property: the truncated Chebyshev expansion of degree $N$ minimizes the maximum absolute deviation $\|f-p_N\|_\infty$ among all degree- $N$ polynomials for a given continuous function $f$ on $[-1,1]$ . These traits—orthogonality and minimax error—reduce the best-case approximation error and improve the conditioning of semi-gradient steps in Q-learning.

2. Network Architecture Modifications

The Ch-DQN modifies standard DQN architecture to introduce a non-parametric Chebyshev feature mapping as follows:

Input Normalization: Each state feature $s_i$ is linearly rescaled to $[-1,1]$ to match the definition domain of $T_n(x)$ .
Chebyshev Feature Layer: For a normalized state $\mathbf{s} = [s_1, ..., s_D]$ , compute $T_0(s_i),...,T_N(s_i)$ for each feature. The resulting vector $\boldsymbol\Phi(\mathbf{s})$ has dimension $D \times (N+1)$ :

$\boldsymbol\Phi(\mathbf s) = \bigl[T_0(s_1), T_1(s_1), ..., T_N(s_1), ..., T_0(s_D), ..., T_N(s_D)\bigr]^\top$

Q-Value Mapping: Chebyshev features are input to a small feed-forward network $f$ (typically 1–2 ReLU MLP layers), yielding $A$ outputs (one per action):

$Q(\mathbf s, a; \theta) = f_a(\boldsymbol\Phi(\mathbf s); \theta)$

When using a linear output layer,

$Q(\mathbf s, a; \theta) = \bigl[W \boldsymbol\Phi(\mathbf s) + b\bigr]_a$

where $W \in \mathbb{R}^{A \times D(N+1)}$ , $b \in \mathbb{R}^A$ .

3. Training Procedure and Loss Formulation

Chebyshev-DQN follows the canonical DQN training pipeline, including experience replay, target network, and ε-greedy action selection. The specific updates are:

Temporal Difference (TD) Target: For each sampled tuple $(s_j, a_j, r_j, s_{j+1})$ :

$y_j = r_j + \gamma \max_{a'} Q(s_{j+1}, a'; \theta^-)$

Loss Function: Minimize mean squared TD error over the replay buffer:

$L(\theta) = \mathbb{E}_{(s, a, r, s') \sim \mathcal D}\left[ (y - Q(s, a; \theta))^2 \right]$

Unchanged Modules: TD backup rule, target network updates, and replay buffer use remain as in DQN. The only architectural change is the Chebyshev feature layer.

Key hyperparameters include: Adam optimizer with learning rate usually $1\mathrm{e}{-}3$ , discount factor $\gamma=0.99$ , replay buffer size 50,000, batch size 64, target update every 500 steps, and Chebyshev degree $N$ selected by cross-validation.

4. Empirical Evaluation and Performance Analysis

Chebyshev-DQN was empirically evaluated on the CartPole-v1 benchmark environment, with 4-dimensional state space and 2 actions. Three independent runs were conducted, each for 500–1000 episodes to ensure convergence. Performance was measured as average episode reward.

Model	Chebyshev Degree $N$	Final Avg. Reward	Performance Differential
Baseline DQN (MLP)	–	≈ 250.5	Reference
Ch-DQN	4	≈ 347.9	+39%
Ch-DQN	8	≈ 144.0	Collapse

Ch-DQN with $N=4$ exhibited both faster convergence and a higher performance plateau than baseline DQN, while excessive degree ( $N=8$ ) resulted in reduced performance due to overfitting to high-frequency components in the TD-targets (spectral bias). This suggests there is an optimal range for $N$ , dependent on the complexity of the true value function.

5. Complexity–Performance Trade-Offs

The Chebyshev-DQN methodology implicates several key trade-offs in model design:

Approximation Error: Projection onto Chebyshev bases lowers the minimal achievable error $\epsilon_{\rm approx} = \min_{f \in \mathcal F}\|Q^* - f\|$ compared to unstructured bases.
Numerical Conditioning: Orthogonality of the polynomial features ameliorates destructive gradient interference, enhancing update stability and yielding lower performance variance. On MountainCar-v0, Ch-DQN exhibited post-training reward standard deviation σ≈3, contrasting with σ≈18 for the baseline.
Spectral Bias and Overfitting: Increasing $N$ broadens the frequency content of the approximation, enabling expressive fits but risking overfitting, particularly when the underlying value function is smooth or represented by low-frequency polynomials. Hence, there exists a "sweet spot" in $N$ selection, balancing underfitting against sensitivity to target noise.

6. Limitations and Guiding Principles for Application

Performance of Ch-DQN is highly sensitive to the Chebyshev polynomial degree $N$ . On tasks such as CartPole-v1, where the value function is known to be low-frequency, moderate $N$ yields robust benefits, but excessive $N$ can destabilize training. The architecture is not universally superior; its efficacy depends on matching feature complexity with the intrinsic complexity of the task. The broader principle is that incorporating orthogonal polynomial features is particularly beneficial in settings where the value function has a smooth, low-frequency structure, but careful hyperparameter tuning remains critical (Yazdannik et al., 20 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Beyond ReLU: Chebyshev-DQN for Enhanced Deep Q-Networks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chebyshev-DQN.

Chebyshev-DQN: Integrating Chebyshev Polynomials in DQN

1. Mathematical Foundation of Chebyshev-DQN

2. Network Architecture Modifications

3. Training Procedure and Loss Formulation

4. Empirical Evaluation and Performance Analysis

5. Complexity–Performance Trade-Offs

6. Limitations and Guiding Principles for Application

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Chebyshev-DQN: Integrating Chebyshev Polynomials in DQN

1. Mathematical Foundation of Chebyshev-DQN

2. Network Architecture Modifications

3. Training Procedure and Loss Formulation

4. Empirical Evaluation and Performance Analysis

5. Complexity–Performance Trade-Offs

6. Limitations and Guiding Principles for Application

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research