Recursive Backwards Q-Learning
- Recursive Backwards Q-Learning is a reinforcement learning algorithm that builds an explicit transition and reward model to enable direct optimal Q-value propagation in deterministic MDPs.
- It employs a single backward sweep using breadth-first search to update Q-values, eliminating incremental temporal-difference bootstrapping and significantly speeding up convergence.
- Empirical results in maze benchmarks show RBQL’s superior sample efficiency and rapid policy optimization compared to standard Q-learning.
Recursive Backwards Q-Learning (RBQL) is a reinforcement learning algorithm designed for finite, deterministic, episodic Markov Decision Processes (MDPs). RBQL departs from model-free approaches such as Q-learning by explicitly building a transition and reward model during exploration and, after each episode, performing a recursive backward value propagation that sets each directly to its optimal value. This eliminates slow temporal-difference bootstrapping and dramatically accelerates convergence in deterministic domains, as demonstrated empirically in shortest-path maze benchmarks (Diekhoff et al., 2024).
1. Deterministic Environment Model and Contrast to Model-Free Q-Learning
RBQL is formulated for settings where:
- (state set) and (action set) are finite;
- Transitions are deterministic: for each , the next state and reward are fixed;
- Episodes begin in a prescribed and terminate in one or more , the sole sources of positive reward.
Standard Q-learning, in contrast, does not construct or exploit a model of or , and updates 0 incrementally at each step: 1 This results in slow per-episode propagation of terminal rewards, especially in large deterministic environments. RBQL, instead, builds 2 (deterministic transitions) and 3 (rewards) as it explores. When a terminal state is reached, it backpropagates value estimates via model inversion and a single backward sweep, directly setting all 4 values for the explored set to their fixed-point optimal values.
2. Formal Definitions and Update Equations
Let 5 denote the true MDP. RBQL maintains estimates:
- 6: the deterministic next state;
- 7: the reward;
Upon reaching terminal state(s) 8, RBQL assigns: 9 For any predecessor 0, the recursive backup equation is \begin{equation} Q(s,a) = \widehat R(s,a) + \gamma\max_{a'\in A} Q(\widehat P(s,a), a') \end{equation}
To propagate values, the algorithm inverts the transition model to construct the predecessor set: 1 A breadth-first search (BFS) commencing from 2 determines the update order, guaranteeing each 3 is set only once per episode and always after all of its downstream dependencies are finalized.
3. Algorithmic Procedure
The central algorithmic workflow comprises two alternating phases: exploration/model-building and recursive backward value propagation. The following summarizes the key steps:
- Initialization: All 4 marked "unknown"; all 5 initialized to 6.
- Exploration: For each episode, the agent explores uncharted 7 pairs or exploits greedily based on current 8 values. Exploration scheduling leverages 9-decay.
- Model Update: After executing 0 in 1, the observed 2 are recorded in 3.
- Episode Termination: Once a terminal 4 is reached, predecessors are computed by inverting 5.
- Backward Sweep: A BFS order is established from 6 over all explored states/actions. For each 7, 8 is updated according to the recursive backup equation.
- Policy Improvement: Exploration flag and 9 are decayed to balance exploration/exploitation.
No temporal-difference (TD) bootstrapping is performed; each 0 is set to its global optimum with respect to the explored model after each episode.
4. Theoretical Properties and Convergence
- Model Optimality: The RBQL backward sweep computes the true optimal 1 values with respect to the currently explored model; each 2 is set exactly once per sweep, with no further adjustment needed until new 3 pairs are discovered.
- Policy Optimality: Once all state–action pairs have been explored, RBQL reconstructs the globally optimal policy, as the single BFS backup is equivalent to dynamic programming value iteration on the discovered model.
- Sample Complexity: Each episode allows all explored states to be evaluated optimally. For a maze of shortest-path depth 4, RBQL achieves solution in essentially one episode (plus exploration overhead), while standard Q-learning can require 5 episodes for equivalent credit propagation.
- Computational Complexity: Each backward sweep is 6 over the explored subset.
- Convergence Dynamics: Empirical evidence and theoretical argument demonstrate that the majority of learning occurs at the first terminal encounter; later episodes serve mainly to integrate newly discovered transitions or actions.
5. Empirical Evaluation: Maze Shortest-Path Task
Performance was measured on ensembles of 7, 8, and 9 random mazes, with each agent run for 25 episodes per maze (averaged over 50 random mazes). The principal metrics are average steps per episode and episode-over-episode improvement factors.
Table 1: Ratio of average steps (Q-Learning vs. RBQL) in episode 0 and episode 24.
| Grid | QL steps (E0) | RBQL steps (E0) | Ratio (E0) | QL steps (E24) | RBQL steps (E24) | Ratio (E24) |
|---|---|---|---|---|---|---|
| 0 | 278.06 | 191.84 | 1.45 | 49.14 | 9.62 | 5.11 |
| 1 | 3308.46 | 843.52 | 3.92 | 281.44 | 23.68 | 11.89 |
| 2 | 7180.98 | 1965.00 | 3.65 | 778.68 | 35.96 | 21.65 |
Table 2: Step count reduction factors from episode 0 to 24.
| Grid | QL Step3 | QL Step4 | QL Factor | RBQL Step5 | RBQL Step6 | RBQL Factor |
|---|---|---|---|---|---|---|
| 7 | 278.06 | 49.14 | 5.66 | 191.84 | 9.62 | 19.94 |
| 8 | 3308.46 | 281.44 | 11.76 | 843.52 | 23.68 | 35.62 |
| 9 | 7180.98 | 778.68 | 9.22 | 1965.00 | 35.96 | 90.76 |
For the 0 grid, after 6 episodes RBQL nearly attains optimal step counts (close to the lower bound 1), whereas Q-learning still averages hundreds of steps after 25 episodes.
6. Practical Considerations: Advantages, Limitations, and Extensions
Advantages
- Accelerated value propagation: terminal reward is propagated instantly via a single backward sweep through the explored deterministic chain.
- All discovered 2 values are set optimally in each episode.
- Exploration focuses on as-yet-unseen actions, with greedy exploitation elsewhere.
- Requires orders of magnitude fewer episodes and steps to converge compared to standard Q-learning.
Limitations
- Applicability is restricted to deterministic transitions; for stochastic environments, direct adoption is not valid.
- RBQL targets episodic MDPs with finite horizons and clear terminal states; continuous or infinite-horizon tasks are unsupported.
- Requires storage of the full model (3, 4) and incurs BFS computation per episode.
Potential Extensions
- Extension to stochastic MDPs by estimating transition probabilities 5 and generalizing the backup equation:
6
- Model compression via state aggregation to mitigate memory and computation requirements.
- Adaptation to multiple terminal states using either a dummy sink state or simultaneous backpropagation from all terminals.
- Hybridization with function approximation (e.g., "Deep RBQL") to address large or continuous state spaces.
7. Summary
Recursive Backwards Q-Learning replaces per-transition temporal-difference learning with per-episode, model-based dynamic programming on the explored state-action graph, achieving rapid convergence for deterministic, episodic MDPs. By constructing and inverting the explicit environment model, RBQL computes optimal 7 values for all encountered transitions in a single pass upon each episode termination. Empirical evidence demonstrates superior sample efficiency and convergence rates over standard Q-learning in grid-world maze tasks. The methodology is well-suited for deterministic, finite, episodic problems and offers clear potential for extension to more general environments (Diekhoff et al., 2024).