Chebyshev-DQN: Integrating Chebyshev Polynomials in DQN
- Chebyshev-DQN is a reinforcement learning variant that leverages Chebyshev polynomial bases to reduce representation bias and improve numerical conditioning in Q-learning.
- It introduces a Chebyshev feature layer that normalizes state inputs and maps them into polynomial features, enabling more stable and efficient network updates.
- Empirical evaluations on benchmarks like CartPole-v1 demonstrate that optimal Chebyshev degree selection boosts convergence and average rewards while preventing overfitting.
Chebyshev-DQN (Ch-DQN) is a variant of the Deep Q-Network (DQN) reinforcement learning algorithm that explicitly incorporates Chebyshev polynomial bases into the value function approximator. The approach is motivated by the superior approximation and stability properties of orthogonal polynomials, aiming to reduce representation bias and improve the numerical conditioning of Q-learning updates. Chebyshev-DQN has demonstrated improved asymptotic and stability performance on standard benchmarks such as CartPole-v1, contingent on appropriate hyperparameterization of the polynomial degree (Yazdannik et al., 20 Aug 2025).
1. Mathematical Foundation of Chebyshev-DQN
Chebyshev polynomials of the first kind, , constitute a set of orthogonal polynomials defined on the interval . They are recursively defined by: Alternatively, . Orthogonality is given by: Functional approximation with Chebyshev polynomials is motivated by their minimax property: the truncated Chebyshev expansion of degree minimizes the maximum absolute deviation among all degree- polynomials for a given continuous function on . These traits—orthogonality and minimax error—reduce the best-case approximation error and improve the conditioning of semi-gradient steps in Q-learning.
2. Network Architecture Modifications
The Ch-DQN modifies standard DQN architecture to introduce a non-parametric Chebyshev feature mapping as follows:
- Input Normalization: Each state feature is linearly rescaled to to match the definition domain of .
- Chebyshev Feature Layer: For a normalized state , compute for each feature. The resulting vector has dimension :
- Q-Value Mapping: Chebyshev features are input to a small feed-forward network (typically 1–2 ReLU MLP layers), yielding outputs (one per action):
When using a linear output layer,
where , .
3. Training Procedure and Loss Formulation
Chebyshev-DQN follows the canonical DQN training pipeline, including experience replay, target network, and ε-greedy action selection. The specific updates are:
- Temporal Difference (TD) Target: For each sampled tuple :
- Loss Function: Minimize mean squared TD error over the replay buffer:
- Unchanged Modules: TD backup rule, target network updates, and replay buffer use remain as in DQN. The only architectural change is the Chebyshev feature layer.
Key hyperparameters include: Adam optimizer with learning rate usually , discount factor , replay buffer size 50,000, batch size 64, target update every 500 steps, and Chebyshev degree selected by cross-validation.
4. Empirical Evaluation and Performance Analysis
Chebyshev-DQN was empirically evaluated on the CartPole-v1 benchmark environment, with 4-dimensional state space and 2 actions. Three independent runs were conducted, each for 500–1000 episodes to ensure convergence. Performance was measured as average episode reward.
| Model | Chebyshev Degree | Final Avg. Reward | Performance Differential |
|---|---|---|---|
| Baseline DQN (MLP) | – | ≈ 250.5 | Reference |
| Ch-DQN | 4 | ≈ 347.9 | +39% |
| Ch-DQN | 8 | ≈ 144.0 | Collapse |
Ch-DQN with exhibited both faster convergence and a higher performance plateau than baseline DQN, while excessive degree () resulted in reduced performance due to overfitting to high-frequency components in the TD-targets (spectral bias). This suggests there is an optimal range for , dependent on the complexity of the true value function.
5. Complexity–Performance Trade-Offs
The Chebyshev-DQN methodology implicates several key trade-offs in model design:
- Approximation Error: Projection onto Chebyshev bases lowers the minimal achievable error compared to unstructured bases.
- Numerical Conditioning: Orthogonality of the polynomial features ameliorates destructive gradient interference, enhancing update stability and yielding lower performance variance. On MountainCar-v0, Ch-DQN exhibited post-training reward standard deviation σ≈3, contrasting with σ≈18 for the baseline.
- Spectral Bias and Overfitting: Increasing broadens the frequency content of the approximation, enabling expressive fits but risking overfitting, particularly when the underlying value function is smooth or represented by low-frequency polynomials. Hence, there exists a "sweet spot" in selection, balancing underfitting against sensitivity to target noise.
6. Limitations and Guiding Principles for Application
Performance of Ch-DQN is highly sensitive to the Chebyshev polynomial degree . On tasks such as CartPole-v1, where the value function is known to be low-frequency, moderate yields robust benefits, but excessive can destabilize training. The architecture is not universally superior; its efficacy depends on matching feature complexity with the intrinsic complexity of the task. The broader principle is that incorporating orthogonal polynomial features is particularly beneficial in settings where the value function has a smooth, low-frequency structure, but careful hyperparameter tuning remains critical (Yazdannik et al., 20 Aug 2025).