Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recursive Backwards Q-Learning

Updated 26 February 2026
  • Recursive Backwards Q-Learning is a reinforcement learning algorithm that builds an explicit transition and reward model to enable direct optimal Q-value propagation in deterministic MDPs.
  • It employs a single backward sweep using breadth-first search to update Q-values, eliminating incremental temporal-difference bootstrapping and significantly speeding up convergence.
  • Empirical results in maze benchmarks show RBQL’s superior sample efficiency and rapid policy optimization compared to standard Q-learning.

Recursive Backwards Q-Learning (RBQL) is a reinforcement learning algorithm designed for finite, deterministic, episodic Markov Decision Processes (MDPs). RBQL departs from model-free approaches such as Q-learning by explicitly building a transition and reward model during exploration and, after each episode, performing a recursive backward value propagation that sets each Q(s,a)Q(s,a) directly to its optimal value. This eliminates slow temporal-difference bootstrapping and dramatically accelerates convergence in deterministic domains, as demonstrated empirically in shortest-path maze benchmarks (Diekhoff et al., 2024).

1. Deterministic Environment Model and Contrast to Model-Free Q-Learning

RBQL is formulated for settings where:

  • SS (state set) and AA (action set) are finite;
  • Transitions are deterministic: for each (s,a)(s,a), the next state ss' and reward rr are fixed;
  • Episodes begin in a prescribed s0s_0 and terminate in one or more sTs_T, the sole sources of positive reward.

Standard Q-learning, in contrast, does not construct or exploit a model of PP or RR, and updates Q(s,a)Q(s,a) incrementally at each step: Q(s,a)    Q(s,a)  +  α(R(s,a)  +  γmaxaQ(s,a)    Q(s,a))Q(s,a)\;\leftarrow\;Q(s,a)\;+\;\alpha\Bigl(R(s,a)\;+\;\gamma\max_{a'}Q\bigl(s',a'\bigr)\;-\;Q(s,a)\Bigr) This results in slow per-episode propagation of terminal rewards, especially in large deterministic environments. RBQL, instead, builds P^\widehat P (deterministic transitions) and R^\widehat R (rewards) as it explores. When a terminal state is reached, it backpropagates value estimates via model inversion and a single backward sweep, directly setting all QQ values for the explored set to their fixed-point optimal values.

2. Formal Definitions and Update Equations

Let M=(S,A,P,R)\mathcal{M}=(S,A,P,R) denote the true MDP. RBQL maintains estimates:

  • P^(s,a)s\widehat P(s,a) \equiv s': the deterministic next state;
  • R^(s,a)r\widehat R(s,a) \equiv r: the reward;

Upon reaching terminal state(s) sTs_T, RBQL assigns: aA:Q(sT,a)  =  R^(sT,a)=0\forall a\in A:\quad Q(s_T,a)\;=\;\widehat R(s_T,a)=0 For any predecessor (s,a)(s,a), the recursive backup equation is \begin{equation} Q(s,a) = \widehat R(s,a) + \gamma\max_{a'\in A} Q(\widehat P(s,a), a') \end{equation}

To propagate values, the algorithm inverts the transition model to construct the predecessor set: Pred(s)={(s,a)P^(s,a)=s}\mathrm{Pred}(s) = \{(s',a') \mid \widehat P(s',a') = s\} A breadth-first search (BFS) commencing from sTs_T determines the update order, guaranteeing each Q(s,a)Q(s,a) is set only once per episode and always after all of its downstream dependencies are finalized.

3. Algorithmic Procedure

The central algorithmic workflow comprises two alternating phases: exploration/model-building and recursive backward value propagation. The following summarizes the key steps:

  1. Initialization: All P^(s,a),R^(s,a)\widehat P(s,a), \widehat R(s,a) marked "unknown"; all Q(s,a)Q(s,a) initialized to $0$.
  2. Exploration: For each episode, the agent explores uncharted %%%%24%%%% pairs or exploits greedily based on current QQ values. Exploration scheduling leverages ϵ\epsilon-decay.
  3. Model Update: After executing aa in ss, the observed (s,r)(s',r) are recorded in P^,R^\widehat P, \widehat R.
  4. Episode Termination: Once a terminal sTs_T is reached, predecessors are computed by inverting P^\widehat P.
  5. Backward Sweep: A BFS order is established from sTs_T over all explored states/actions. For each (s,a)(s',a'), QQ is updated according to the recursive backup equation.
  6. Policy Improvement: Exploration flag and ϵ\epsilon are decayed to balance exploration/exploitation.

No temporal-difference (TD) bootstrapping is performed; each Q(s,a)Q(s,a) is set to its global optimum with respect to the explored model after each episode.

4. Theoretical Properties and Convergence

  • Model Optimality: The RBQL backward sweep computes the true optimal Q(s,a)Q(s,a) values with respect to the currently explored model; each Q(s,a)Q(s,a) is set exactly once per sweep, with no further adjustment needed until new (s,a)(s,a) pairs are discovered.
  • Policy Optimality: Once all state–action pairs have been explored, RBQL reconstructs the globally optimal policy, as the single BFS backup is equivalent to dynamic programming value iteration on the discovered model.
  • Sample Complexity: Each episode allows all explored states to be evaluated optimally. For a maze of shortest-path depth dd, RBQL achieves solution in essentially one episode (plus exploration overhead), while standard Q-learning can require O(d)O(d) episodes for equivalent credit propagation.
  • Computational Complexity: Each backward sweep is O(S+AS)O(|S| + |A||S|) over the explored subset.
  • Convergence Dynamics: Empirical evidence and theoretical argument demonstrate that the majority of learning occurs at the first terminal encounter; later episodes serve mainly to integrate newly discovered transitions or actions.

5. Empirical Evaluation: Maze Shortest-Path Task

Performance was measured on ensembles of 5×55\times5, 10×1010\times10, and 15×1515\times15 random mazes, with each agent run for 25 episodes per maze (averaged over 50 random mazes). The principal metrics are average steps per episode and episode-over-episode improvement factors.

Table 1: Ratio of average steps (Q-Learning vs. RBQL) in episode 0 and episode 24.

Grid QL steps (E0) RBQL steps (E0) Ratio (E0) QL steps (E24) RBQL steps (E24) Ratio (E24)
5×55\times5 278.06 191.84 1.45 49.14 9.62 5.11
10×1010\times10 3308.46 843.52 3.92 281.44 23.68 11.89
15×1515\times15 7180.98 1965.00 3.65 778.68 35.96 21.65

Table 2: Step count reduction factors from episode 0 to 24.

Grid QL Step0_0 QL Step24_{24} QL Factor RBQL Step0_0 RBQL Step24_{24} RBQL Factor
5×55\times5 278.06 49.14 5.66 191.84 9.62 19.94
10×1010\times10 3308.46 281.44 11.76 843.52 23.68 35.62
15×1515\times15 7180.98 778.68 9.22 1965.00 35.96 90.76

For the 10×1010\times10 grid, after 6 episodes RBQL nearly attains optimal step counts (close to the lower bound $2s-2$), whereas Q-learning still averages hundreds of steps after 25 episodes.

6. Practical Considerations: Advantages, Limitations, and Extensions

Advantages

  • Accelerated value propagation: terminal reward is propagated instantly via a single backward sweep through the explored deterministic chain.
  • All discovered QQ values are set optimally in each episode.
  • Exploration focuses on as-yet-unseen actions, with greedy exploitation elsewhere.
  • Requires orders of magnitude fewer episodes and steps to converge compared to standard Q-learning.

Limitations

  • Applicability is restricted to deterministic transitions; for stochastic environments, direct adoption is not valid.
  • RBQL targets episodic MDPs with finite horizons and clear terminal states; continuous or infinite-horizon tasks are unsupported.
  • Requires storage of the full model (P^\widehat P, R^\widehat R) and incurs BFS computation per episode.

Potential Extensions

  • Extension to stochastic MDPs by estimating transition probabilities p(ss,a)p(s'|s,a) and generalizing the backup equation:

Q(s,a)sp(ss,a)(R(s,a,s)+γmaxaQ(s,a))Q(s,a)\leftarrow \sum_{s'} p(s'|s,a)\left(R(s,a,s') + \gamma\max_{a'} Q(s',a')\right)

  • Model compression via state aggregation to mitigate memory and computation requirements.
  • Adaptation to multiple terminal states using either a dummy sink state or simultaneous backpropagation from all terminals.
  • Hybridization with function approximation (e.g., "Deep RBQL") to address large or continuous state spaces.

7. Summary

Recursive Backwards Q-Learning replaces per-transition temporal-difference learning with per-episode, model-based dynamic programming on the explored state-action graph, achieving rapid convergence for deterministic, episodic MDPs. By constructing and inverting the explicit environment model, RBQL computes optimal QQ values for all encountered transitions in a single pass upon each episode termination. Empirical evidence demonstrates superior sample efficiency and convergence rates over standard Q-learning in grid-world maze tasks. The methodology is well-suited for deterministic, finite, episodic problems and offers clear potential for extension to more general environments (Diekhoff et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recursive Backwards Q-Learning (RBQL).