Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recursive Backwards Q-Learning

Updated 26 February 2026
  • Recursive Backwards Q-Learning is a reinforcement learning algorithm that builds an explicit transition and reward model to enable direct optimal Q-value propagation in deterministic MDPs.
  • It employs a single backward sweep using breadth-first search to update Q-values, eliminating incremental temporal-difference bootstrapping and significantly speeding up convergence.
  • Empirical results in maze benchmarks show RBQL’s superior sample efficiency and rapid policy optimization compared to standard Q-learning.

Recursive Backwards Q-Learning (RBQL) is a reinforcement learning algorithm designed for finite, deterministic, episodic Markov Decision Processes (MDPs). RBQL departs from model-free approaches such as Q-learning by explicitly building a transition and reward model during exploration and, after each episode, performing a recursive backward value propagation that sets each Q(s,a)Q(s,a) directly to its optimal value. This eliminates slow temporal-difference bootstrapping and dramatically accelerates convergence in deterministic domains, as demonstrated empirically in shortest-path maze benchmarks (Diekhoff et al., 2024).

1. Deterministic Environment Model and Contrast to Model-Free Q-Learning

RBQL is formulated for settings where:

  • SS (state set) and AA (action set) are finite;
  • Transitions are deterministic: for each (s,a)(s,a), the next state ss' and reward rr are fixed;
  • Episodes begin in a prescribed s0s_0 and terminate in one or more sTs_T, the sole sources of positive reward.

Standard Q-learning, in contrast, does not construct or exploit a model of PP or RR, and updates SS0 incrementally at each step: SS1 This results in slow per-episode propagation of terminal rewards, especially in large deterministic environments. RBQL, instead, builds SS2 (deterministic transitions) and SS3 (rewards) as it explores. When a terminal state is reached, it backpropagates value estimates via model inversion and a single backward sweep, directly setting all SS4 values for the explored set to their fixed-point optimal values.

2. Formal Definitions and Update Equations

Let SS5 denote the true MDP. RBQL maintains estimates:

  • SS6: the deterministic next state;
  • SS7: the reward;

Upon reaching terminal state(s) SS8, RBQL assigns: SS9 For any predecessor AA0, the recursive backup equation is \begin{equation} Q(s,a) = \widehat R(s,a) + \gamma\max_{a'\in A} Q(\widehat P(s,a), a') \end{equation}

To propagate values, the algorithm inverts the transition model to construct the predecessor set: AA1 A breadth-first search (BFS) commencing from AA2 determines the update order, guaranteeing each AA3 is set only once per episode and always after all of its downstream dependencies are finalized.

3. Algorithmic Procedure

The central algorithmic workflow comprises two alternating phases: exploration/model-building and recursive backward value propagation. The following summarizes the key steps:

  1. Initialization: All AA4 marked "unknown"; all AA5 initialized to AA6.
  2. Exploration: For each episode, the agent explores uncharted AA7 pairs or exploits greedily based on current AA8 values. Exploration scheduling leverages AA9-decay.
  3. Model Update: After executing (s,a)(s,a)0 in (s,a)(s,a)1, the observed (s,a)(s,a)2 are recorded in (s,a)(s,a)3.
  4. Episode Termination: Once a terminal (s,a)(s,a)4 is reached, predecessors are computed by inverting (s,a)(s,a)5.
  5. Backward Sweep: A BFS order is established from (s,a)(s,a)6 over all explored states/actions. For each (s,a)(s,a)7, (s,a)(s,a)8 is updated according to the recursive backup equation.
  6. Policy Improvement: Exploration flag and (s,a)(s,a)9 are decayed to balance exploration/exploitation.

No temporal-difference (TD) bootstrapping is performed; each ss'0 is set to its global optimum with respect to the explored model after each episode.

4. Theoretical Properties and Convergence

  • Model Optimality: The RBQL backward sweep computes the true optimal ss'1 values with respect to the currently explored model; each ss'2 is set exactly once per sweep, with no further adjustment needed until new ss'3 pairs are discovered.
  • Policy Optimality: Once all state–action pairs have been explored, RBQL reconstructs the globally optimal policy, as the single BFS backup is equivalent to dynamic programming value iteration on the discovered model.
  • Sample Complexity: Each episode allows all explored states to be evaluated optimally. For a maze of shortest-path depth ss'4, RBQL achieves solution in essentially one episode (plus exploration overhead), while standard Q-learning can require ss'5 episodes for equivalent credit propagation.
  • Computational Complexity: Each backward sweep is ss'6 over the explored subset.
  • Convergence Dynamics: Empirical evidence and theoretical argument demonstrate that the majority of learning occurs at the first terminal encounter; later episodes serve mainly to integrate newly discovered transitions or actions.

5. Empirical Evaluation: Maze Shortest-Path Task

Performance was measured on ensembles of ss'7, ss'8, and ss'9 random mazes, with each agent run for 25 episodes per maze (averaged over 50 random mazes). The principal metrics are average steps per episode and episode-over-episode improvement factors.

Table 1: Ratio of average steps (Q-Learning vs. RBQL) in episode 0 and episode 24.

Grid QL steps (E0) RBQL steps (E0) Ratio (E0) QL steps (E24) RBQL steps (E24) Ratio (E24)
rr0 278.06 191.84 1.45 49.14 9.62 5.11
rr1 3308.46 843.52 3.92 281.44 23.68 11.89
rr2 7180.98 1965.00 3.65 778.68 35.96 21.65

Table 2: Step count reduction factors from episode 0 to 24.

Grid QL Steprr3 QL Steprr4 QL Factor RBQL Steprr5 RBQL Steprr6 RBQL Factor
rr7 278.06 49.14 5.66 191.84 9.62 19.94
rr8 3308.46 281.44 11.76 843.52 23.68 35.62
rr9 7180.98 778.68 9.22 1965.00 35.96 90.76

For the s0s_00 grid, after 6 episodes RBQL nearly attains optimal step counts (close to the lower bound s0s_01), whereas Q-learning still averages hundreds of steps after 25 episodes.

6. Practical Considerations: Advantages, Limitations, and Extensions

Advantages

  • Accelerated value propagation: terminal reward is propagated instantly via a single backward sweep through the explored deterministic chain.
  • All discovered s0s_02 values are set optimally in each episode.
  • Exploration focuses on as-yet-unseen actions, with greedy exploitation elsewhere.
  • Requires orders of magnitude fewer episodes and steps to converge compared to standard Q-learning.

Limitations

  • Applicability is restricted to deterministic transitions; for stochastic environments, direct adoption is not valid.
  • RBQL targets episodic MDPs with finite horizons and clear terminal states; continuous or infinite-horizon tasks are unsupported.
  • Requires storage of the full model (s0s_03, s0s_04) and incurs BFS computation per episode.

Potential Extensions

  • Extension to stochastic MDPs by estimating transition probabilities s0s_05 and generalizing the backup equation:

s0s_06

  • Model compression via state aggregation to mitigate memory and computation requirements.
  • Adaptation to multiple terminal states using either a dummy sink state or simultaneous backpropagation from all terminals.
  • Hybridization with function approximation (e.g., "Deep RBQL") to address large or continuous state spaces.

7. Summary

Recursive Backwards Q-Learning replaces per-transition temporal-difference learning with per-episode, model-based dynamic programming on the explored state-action graph, achieving rapid convergence for deterministic, episodic MDPs. By constructing and inverting the explicit environment model, RBQL computes optimal s0s_07 values for all encountered transitions in a single pass upon each episode termination. Empirical evidence demonstrates superior sample efficiency and convergence rates over standard Q-learning in grid-world maze tasks. The methodology is well-suited for deterministic, finite, episodic problems and offers clear potential for extension to more general environments (Diekhoff et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recursive Backwards Q-Learning (RBQL).