CartPole Benchmark: Evaluating RL Methods

Updated 23 November 2025

CartPole is a benchmark simulating an inverted pendulum on a moving cart to evaluate reinforcement learning and control strategies in a nonlinear, underactuated system.
It has been utilized to benchmark a range of methodologies—including classical, deep, fuzzy, quantum, and physics-informed approaches—highlighting differences in sample efficiency and policy interpretability.
Extensions such as partial observability, swing-up tasks, and risk-sensitive adaptations provide practical insights into real-world control challenges and algorithm robustness.

The CartPole benchmark is a canonical platform for the paper and evaluation of reinforcement learning (RL) and control strategies in nonlinear, underactuated dynamical systems. Originally formulated as the problem of balancing an inverted pendulum mounted on a moving cart, the CartPole system encapsulates key challenges in sequential decision-making, control under uncertainty, and function approximation. Over several decades, it has served as a principal testbed for classical, neural, fuzzy, quantum, and hybrid RL algorithms, undergoing numerous extensions both in simulation and real-world deployments.

1. CartPole System: Dynamics and Environment Definitions

The standard CartPole environment is defined by a four-dimensional continuous state space $s = (x, \dot x, \theta, \dot\theta)$ , where $x$ is the horizontal cart position, $\dot x$ its velocity, $\theta$ the pole angle from vertical (in radians), and $\dot\theta$ the angular velocity. The system is controlled via either a discrete or continuous horizontal force applied to the cart, realized as actions $a$ in $\{0,1\}$ (left/right; force ±F) or as $a \in [ -F_\text{max}, F_\text{max} ]$ , depending on the benchmark variant (Lange et al., 16 Nov 2025, Kumar, 2020, Kim et al., 3 Aug 2025, Duan et al., 2016). System dynamics are governed by nonlinear equations capturing the coupled evolution of cart and pole under gravity and applied force.

The instantaneous reward is typically $r_t = +1$ per time step until termination; an episode terminates upon violation of bounds (e.g., $|x|>2.4$ m, $|\theta| > 12^\circ$ ), or after a maximum number of steps (e.g., 200 or 500, depending on the version) (Kumar, 2020). Gymnasium's “solved” criteria is attaining a mean return ≥195 over 100 episodes. State observations may be direct (state vector), partial (e.g., angle plus angular velocity), or derived from high-dimensional images in perception-based variants (Xu et al., 2021).

Extensions include swing-up tasks (maximizing $\cos(\theta)$ ), reward shaping for exploration, and stochastic or partially observed versions for robustness studies (Duan et al., 2016, Lange et al., 16 Nov 2025, Xu et al., 2021).

2. Algorithmic Methodologies: Classical, Deep, Fuzzy, and Quantum

A wide array of RL and control algorithms have been developed and benchmarked on CartPole:

Tabular and Linear RL: Early approaches utilize discretization of the state space for tabular Q-learning, or linear function approximation for value/policy functions, yielding fast convergence but limited scalability (Araújo et al., 2020, Nagendra et al., 2018).
Deep Q-Networks (DQN): Multi-layer perceptrons approximate action-values, enhanced by experience replay and target networks. Algorithmic advances include Double DQN, dueling architecture, and prioritized experience replay (PER); PER accelerates learning, enabling solutions in ≈50 episodes on CartPole-v0, outperforming vanilla DQN and tabular Q (Kumar, 2020). Chebyshev-DQN augments DQN with orthogonal polynomial features, achieving a 39% improvement in asymptotic reward relative to baseline MLP DQN at optimal polynomial degree (N=4), but experiencing overfitting at higher degrees (N=8) (Yazdannik et al., 20 Aug 2025).
On-Policy Actor-Critic and Policy Optimization: Approaches such as PPO, TRPO, TNPG, and SAC are dominant for continuous or high-dimensional benchmarks. PPO, specifically, stabilizes neuro-fuzzy controllers (ANFIS) by clipping policy updates, resulting in rapid, robust convergence; PPO-trained ANFIS agents achieve perfect scores (500) within 20,000 updates, compared to >100,000 by ANFIS-DQN, with near-zero post-convergence variance (Shankar et al., 22 Jun 2025). TRPO delivers the highest returns on both balancing and swing-up, especially under partial observability (Duan et al., 2016).
Adaptive and Interpretable RL: Adaptive Q-learning (AQL, SPAQL, SPAQL-TS) employs online, partition-based value estimation. SPAQL-TS, in particular, solves CartPole-v0 in ≈175 episodes, surpassing TRPO (200–250), producing explicit, interpretable policy partitions over the normalized state-action space (Araújo et al., 2020).
Model-Based and Batch RL: NFQ2.0, an evolution of Neural Fitted Q Iteration, adopts large batch updates (batch sizes ≥2048), full dataset replay, and stringent reward shaping. On real hardware, it solves swing-up and balancing in ≈70–120 episodes, achieving low angular deviation and high robustness across runs (Lange et al., 16 Nov 2025).
Quantum and Hybrid RL: Continuous-variable photonic circuits and variational quantum circuits (VQC) have been instantiated as CartPole policies. Both photonic PPO agents and model-based quantum RL agents, using quantum state encodings and gradient-free optimization, can attain competitive or even faster convergence versus classical NNs with the same parameter count, although quantum models are currently less data-efficient (Nagy et al., 2021, Eisenmann et al., 14 Apr 2024).
Physics-Informed Neural Methods: Physics-informed policy iteration (PINN-PI) solves the Hamilton–Jacobi–Bellman equation for stochastic CartPole, yielding provably exponential convergence to the value function and outperforming model-free SAC under process noise, with explicit error bounds on value gradient to policy mapping (Kim et al., 3 Aug 2025).
Risk-Sensitive and Robust Control: Conditional Value-at-Risk (CVaR) objectives and robust $H_\infty$ synthesis have been benchmarked, particularly in the presence of noise and model uncertainty. Adaptive risk-sensitive MPC (ARS-MPC) consistently outperforms distributional RL baselines under high process noise (Wang et al., 2020). Coupling robust control analysis with RL highlights regimes where classical sensitivity bounds predict RL sample complexity (Xu et al., 2021).

3. Benchmark Variants and Evaluation Protocols

The CartPole benchmark has evolved, with variants designed to assess data-efficiency, robustness, and policy interpretability:

Partial Observability: Sensing can be partially occluded (e.g., only one pole marker visible), degraded (noisy depth/RGB images), or delayed. These settings allow explicit control over intrinsic task difficulty, as characterized by H-infinity sensitivity bounds (Xu et al., 2021, Duan et al., 2016).
Real-World Realizations: NFQ2.0 demonstrates direct deployment on hardware platforms with latency/jitter compensation and open-loop exploration, yielding stable high-fidelity control within bounded wall-clock time (Lange et al., 16 Nov 2025).
Perception-Based Regulation: CNN regression from depth or RGB images has been integrated with both RL and robust control pipelines to quantify the effect of sensor modality on sample complexity and maximum reward (Xu et al., 2021).
Swing-Up Tasks: More challenging than balancing; reward is a function of the pole’s cosine, penalizing deviations outside large domains (Duan et al., 2016).
Quantum Analogs: Both classical and quantum CartPoles have been proposed, including nonlinear quantum cartpoles with Hamiltonians $V(x) = kx^2$ , $V(x) = k_1[\cos(\pi x/k_2)-1]$ , or $V(x) = -kx^4$ , with state estimation under measurement backaction and transfer learning between classical surrogates and quantum agents (Meinerz et al., 2023, Wang et al., 2019).

4. Quantitative Benchmark Comparisons

Representative performance and convergence metrics for key methods on CartPole-v1/v0 (discrete-action, length-200/500 episodes):

Algorithm/Architecture	Episodes to Solve (≥195)	Final Avg. Reward	Notable Properties
DQN + PER	≈50	~200 (v0)	Fastest among deep Q methods
Chebyshev-DQN (N=4)	≈30	347.9	39% > baseline DQN on v1
PPO-ANFIS	~20,000 updates	500 ± 0	Near-zero final variance
NFQ2.0 (batch RL)	70–120 episodes	—	Real hardware, low run variance
SPAQL-TS	≈175 episodes	198.5 ± 1.6	Interpretability, sample eff.
TRPO (balancing)	≈200–250	197.3 ± 3.2	High but less interpretable
Classical LQR	—	[Perfect] in LQR zone	Requires dynamics, not RL
PINN-PI (stochastic)	—	200 ± 5 (stochastic return)	Theoretical exponential convergence

Additional findings:

Policy expressiveness (e.g., polynomial basis expansion, fuzzy rules, variational circuits) yields significant data-efficiency improvements, but can degrade with over-parameterization (e.g., Chebyshev-DQN N=8) (Yazdannik et al., 20 Aug 2025).
Prioritized replay, entropy regularization, and reward shaping remain critical for rapid, stable convergence.
Classical robust control theory predicts and bounds RL performance as environmental uncertainty increases (Xu et al., 2021).
Transparency: SPAQL-TS and ANFIS-PPO provide explicit rule/partition-based policies, offering interpretability not available in neural black-box actors (Shankar et al., 22 Jun 2025, Araújo et al., 2020).

5. Practical Guidelines and Limitations

Empirical and ablation studies across CartPole research yield ensemble recommendations:

Use batch sizes ≥1024, modern MLPs (e.g., 256×256×100), and Adam optimizer for stability (Lange et al., 16 Nov 2025).
Reward shaping and history stacking (n≥2, n=6 typical) compensate for exploration and latency (Lange et al., 16 Nov 2025).
In low-dimensional benchmarks, polynomial/partition-based policies excel; for high-dimensional, MLPs or PINNs dominate (Yazdannik et al., 20 Aug 2025, Kim et al., 3 Aug 2025).
Over-parameterization, excessive replay or lack of entropy regularization can cause performance collapse or oscillations (Yazdannik et al., 20 Aug 2025, Kumar, 2020).
Interpretable controllers may be favored in domains where transparency and explainability are crucial (Araújo et al., 2020, Shankar et al., 22 Jun 2025).
No single architecture is uniformly optimal—task structure, observation regime, and hardware constraints drive method selection.

Recognized limitations are that simple CartPole reward structures make it possible for suboptimal or brittle policies to attain high average reward, so robustness, generalizability, and nonlinear/delayed observability must be systematically tested to distinguish algorithmic competitiveness (Duan et al., 2016, Xu et al., 2021). Quantum approaches, while promising, are presently less data-efficient than classical neural models but serve as proofs-of-concept for near-term quantum RL (Nagy et al., 2021, Eisenmann et al., 14 Apr 2024).

6. Extensions and Research Directions

Active research on CartPole variants explores:

Quantum RL: Development of hybrid and native quantum circuits for control, backaction-aware RL, and quantum–classical transfer learning (Meinerz et al., 2023, Wang et al., 2019, Nagy et al., 2021, Eisenmann et al., 14 Apr 2024).
Risk-Sensitive and Distributional RL: Incorporation of CVaR and other risk measures for tail control and performance under rare, adverse disturbances (Wang et al., 2020).
Physics-Informed, Model-Based, and Offline RL: PINN approaches, model-based policy search (classical and quantum), and offline RL algorithms for data efficiency and theoretical convergence guarantees (Kim et al., 3 Aug 2025, Eisenmann et al., 14 Apr 2024).
Real-System Deployment and Industrialization: NFQ2.0 and related pipelines adapt RL for reproducibility and reliability on real actuators/sensor stacks, enabling practical deployments outside simulation (Lange et al., 16 Nov 2025).
Theory–Empirical Synthesis: Integration of robust control metrics (e.g., H-infinity norms, system zeros) to predict and bound RL behavior, especially in perception-driven and partially observed settings (Xu et al., 2021).

As a result, CartPole maintains its role as both a pedagogic baseline and a proving ground for state-of-the-art methods, spanning explainable RL, quantum-enhanced policies, and robust real-time control.