Papers
Topics
Authors
Recent
2000 character limit reached

CartPole Benchmark: Evaluating RL Methods

Updated 23 November 2025
  • CartPole is a benchmark simulating an inverted pendulum on a moving cart to evaluate reinforcement learning and control strategies in a nonlinear, underactuated system.
  • It has been utilized to benchmark a range of methodologies—including classical, deep, fuzzy, quantum, and physics-informed approaches—highlighting differences in sample efficiency and policy interpretability.
  • Extensions such as partial observability, swing-up tasks, and risk-sensitive adaptations provide practical insights into real-world control challenges and algorithm robustness.

The CartPole benchmark is a canonical platform for the paper and evaluation of reinforcement learning (RL) and control strategies in nonlinear, underactuated dynamical systems. Originally formulated as the problem of balancing an inverted pendulum mounted on a moving cart, the CartPole system encapsulates key challenges in sequential decision-making, control under uncertainty, and function approximation. Over several decades, it has served as a principal testbed for classical, neural, fuzzy, quantum, and hybrid RL algorithms, undergoing numerous extensions both in simulation and real-world deployments.

1. CartPole System: Dynamics and Environment Definitions

The standard CartPole environment is defined by a four-dimensional continuous state space s=(x,x˙,θ,θ˙)s = (x, \dot x, \theta, \dot\theta), where xx is the horizontal cart position, x˙\dot x its velocity, θ\theta the pole angle from vertical (in radians), and θ˙\dot\theta the angular velocity. The system is controlled via either a discrete or continuous horizontal force applied to the cart, realized as actions aa in {0,1}\{0,1\} (left/right; force ±F) or as a[Fmax,Fmax]a \in [ -F_\text{max}, F_\text{max} ], depending on the benchmark variant (Lange et al., 16 Nov 2025, Kumar, 2020, Kim et al., 3 Aug 2025, Duan et al., 2016). System dynamics are governed by nonlinear equations capturing the coupled evolution of cart and pole under gravity and applied force.

The instantaneous reward is typically rt=+1r_t = +1 per time step until termination; an episode terminates upon violation of bounds (e.g., x>2.4|x|>2.4 m, θ>12|\theta| > 12^\circ), or after a maximum number of steps (e.g., 200 or 500, depending on the version) (Kumar, 2020). Gymnasium's “solved” criteria is attaining a mean return ≥195 over 100 episodes. State observations may be direct (state vector), partial (e.g., angle plus angular velocity), or derived from high-dimensional images in perception-based variants (Xu et al., 2021).

Extensions include swing-up tasks (maximizing cos(θ)\cos(\theta)), reward shaping for exploration, and stochastic or partially observed versions for robustness studies (Duan et al., 2016, Lange et al., 16 Nov 2025, Xu et al., 2021).

2. Algorithmic Methodologies: Classical, Deep, Fuzzy, and Quantum

A wide array of RL and control algorithms have been developed and benchmarked on CartPole:

  1. Tabular and Linear RL: Early approaches utilize discretization of the state space for tabular Q-learning, or linear function approximation for value/policy functions, yielding fast convergence but limited scalability (Araújo et al., 2020, Nagendra et al., 2018).
  2. Deep Q-Networks (DQN): Multi-layer perceptrons approximate action-values, enhanced by experience replay and target networks. Algorithmic advances include Double DQN, dueling architecture, and prioritized experience replay (PER); PER accelerates learning, enabling solutions in ≈50 episodes on CartPole-v0, outperforming vanilla DQN and tabular Q (Kumar, 2020). Chebyshev-DQN augments DQN with orthogonal polynomial features, achieving a 39% improvement in asymptotic reward relative to baseline MLP DQN at optimal polynomial degree (N=4), but experiencing overfitting at higher degrees (N=8) (Yazdannik et al., 20 Aug 2025).
  3. On-Policy Actor-Critic and Policy Optimization: Approaches such as PPO, TRPO, TNPG, and SAC are dominant for continuous or high-dimensional benchmarks. PPO, specifically, stabilizes neuro-fuzzy controllers (ANFIS) by clipping policy updates, resulting in rapid, robust convergence; PPO-trained ANFIS agents achieve perfect scores (500) within 20,000 updates, compared to >100,000 by ANFIS-DQN, with near-zero post-convergence variance (Shankar et al., 22 Jun 2025). TRPO delivers the highest returns on both balancing and swing-up, especially under partial observability (Duan et al., 2016).
  4. Adaptive and Interpretable RL: Adaptive Q-learning (AQL, SPAQL, SPAQL-TS) employs online, partition-based value estimation. SPAQL-TS, in particular, solves CartPole-v0 in ≈175 episodes, surpassing TRPO (200–250), producing explicit, interpretable policy partitions over the normalized state-action space (Araújo et al., 2020).
  5. Model-Based and Batch RL: NFQ2.0, an evolution of Neural Fitted Q Iteration, adopts large batch updates (batch sizes ≥2048), full dataset replay, and stringent reward shaping. On real hardware, it solves swing-up and balancing in ≈70–120 episodes, achieving low angular deviation and high robustness across runs (Lange et al., 16 Nov 2025).
  6. Quantum and Hybrid RL: Continuous-variable photonic circuits and variational quantum circuits (VQC) have been instantiated as CartPole policies. Both photonic PPO agents and model-based quantum RL agents, using quantum state encodings and gradient-free optimization, can attain competitive or even faster convergence versus classical NNs with the same parameter count, although quantum models are currently less data-efficient (Nagy et al., 2021, Eisenmann et al., 14 Apr 2024).
  7. Physics-Informed Neural Methods: Physics-informed policy iteration (PINN-PI) solves the Hamilton–Jacobi–Bellman equation for stochastic CartPole, yielding provably exponential convergence to the value function and outperforming model-free SAC under process noise, with explicit error bounds on value gradient to policy mapping (Kim et al., 3 Aug 2025).
  8. Risk-Sensitive and Robust Control: Conditional Value-at-Risk (CVaR) objectives and robust HH_\infty synthesis have been benchmarked, particularly in the presence of noise and model uncertainty. Adaptive risk-sensitive MPC (ARS-MPC) consistently outperforms distributional RL baselines under high process noise (Wang et al., 2020). Coupling robust control analysis with RL highlights regimes where classical sensitivity bounds predict RL sample complexity (Xu et al., 2021).

3. Benchmark Variants and Evaluation Protocols

The CartPole benchmark has evolved, with variants designed to assess data-efficiency, robustness, and policy interpretability:

  • Partial Observability: Sensing can be partially occluded (e.g., only one pole marker visible), degraded (noisy depth/RGB images), or delayed. These settings allow explicit control over intrinsic task difficulty, as characterized by H-infinity sensitivity bounds (Xu et al., 2021, Duan et al., 2016).
  • Real-World Realizations: NFQ2.0 demonstrates direct deployment on hardware platforms with latency/jitter compensation and open-loop exploration, yielding stable high-fidelity control within bounded wall-clock time (Lange et al., 16 Nov 2025).
  • Perception-Based Regulation: CNN regression from depth or RGB images has been integrated with both RL and robust control pipelines to quantify the effect of sensor modality on sample complexity and maximum reward (Xu et al., 2021).
  • Swing-Up Tasks: More challenging than balancing; reward is a function of the pole’s cosine, penalizing deviations outside large domains (Duan et al., 2016).
  • Quantum Analogs: Both classical and quantum CartPoles have been proposed, including nonlinear quantum cartpoles with Hamiltonians V(x)=kx2V(x) = kx^2, V(x)=k1[cos(πx/k2)1]V(x) = k_1[\cos(\pi x/k_2)-1], or V(x)=kx4V(x) = -kx^4, with state estimation under measurement backaction and transfer learning between classical surrogates and quantum agents (Meinerz et al., 2023, Wang et al., 2019).

4. Quantitative Benchmark Comparisons

Representative performance and convergence metrics for key methods on CartPole-v1/v0 (discrete-action, length-200/500 episodes):

Algorithm/Architecture Episodes to Solve (≥195) Final Avg. Reward Notable Properties
DQN + PER ≈50 ~200 (v0) Fastest among deep Q methods
Chebyshev-DQN (N=4) ≈30 347.9 39% > baseline DQN on v1
PPO-ANFIS ~20,000 updates 500 ± 0 Near-zero final variance
NFQ2.0 (batch RL) 70–120 episodes Real hardware, low run variance
SPAQL-TS ≈175 episodes 198.5 ± 1.6 Interpretability, sample eff.
TRPO (balancing) ≈200–250 197.3 ± 3.2 High but less interpretable
Classical LQR [Perfect] in LQR zone Requires dynamics, not RL
PINN-PI (stochastic) 200 ± 5 (stochastic return) Theoretical exponential convergence

Additional findings:

  • Policy expressiveness (e.g., polynomial basis expansion, fuzzy rules, variational circuits) yields significant data-efficiency improvements, but can degrade with over-parameterization (e.g., Chebyshev-DQN N=8) (Yazdannik et al., 20 Aug 2025).
  • Prioritized replay, entropy regularization, and reward shaping remain critical for rapid, stable convergence.
  • Classical robust control theory predicts and bounds RL performance as environmental uncertainty increases (Xu et al., 2021).
  • Transparency: SPAQL-TS and ANFIS-PPO provide explicit rule/partition-based policies, offering interpretability not available in neural black-box actors (Shankar et al., 22 Jun 2025, Araújo et al., 2020).

5. Practical Guidelines and Limitations

Empirical and ablation studies across CartPole research yield ensemble recommendations:

Recognized limitations are that simple CartPole reward structures make it possible for suboptimal or brittle policies to attain high average reward, so robustness, generalizability, and nonlinear/delayed observability must be systematically tested to distinguish algorithmic competitiveness (Duan et al., 2016, Xu et al., 2021). Quantum approaches, while promising, are presently less data-efficient than classical neural models but serve as proofs-of-concept for near-term quantum RL (Nagy et al., 2021, Eisenmann et al., 14 Apr 2024).

6. Extensions and Research Directions

Active research on CartPole variants explores:

  • Quantum RL: Development of hybrid and native quantum circuits for control, backaction-aware RL, and quantum–classical transfer learning (Meinerz et al., 2023, Wang et al., 2019, Nagy et al., 2021, Eisenmann et al., 14 Apr 2024).
  • Risk-Sensitive and Distributional RL: Incorporation of CVaR and other risk measures for tail control and performance under rare, adverse disturbances (Wang et al., 2020).
  • Physics-Informed, Model-Based, and Offline RL: PINN approaches, model-based policy search (classical and quantum), and offline RL algorithms for data efficiency and theoretical convergence guarantees (Kim et al., 3 Aug 2025, Eisenmann et al., 14 Apr 2024).
  • Real-System Deployment and Industrialization: NFQ2.0 and related pipelines adapt RL for reproducibility and reliability on real actuators/sensor stacks, enabling practical deployments outside simulation (Lange et al., 16 Nov 2025).
  • Theory–Empirical Synthesis: Integration of robust control metrics (e.g., H-infinity norms, system zeros) to predict and bound RL behavior, especially in perception-driven and partially observed settings (Xu et al., 2021).

As a result, CartPole maintains its role as both a pedagogic baseline and a proving ground for state-of-the-art methods, spanning explainable RL, quantum-enhanced policies, and robust real-time control.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CartPole Benchmark.