Recursive Backwards Q-Learning

Updated 26 February 2026

Recursive Backwards Q-Learning is a reinforcement learning algorithm that builds an explicit transition and reward model to enable direct optimal Q-value propagation in deterministic MDPs.
It employs a single backward sweep using breadth-first search to update Q-values, eliminating incremental temporal-difference bootstrapping and significantly speeding up convergence.
Empirical results in maze benchmarks show RBQL’s superior sample efficiency and rapid policy optimization compared to standard Q-learning.

Recursive Backwards Q-Learning (RBQL) is a reinforcement learning algorithm designed for finite, deterministic, episodic Markov Decision Processes (MDPs). RBQL departs from model-free approaches such as Q-learning by explicitly building a transition and reward model during exploration and, after each episode, performing a recursive backward value propagation that sets each $Q(s,a)$ directly to its optimal value. This eliminates slow temporal-difference bootstrapping and dramatically accelerates convergence in deterministic domains, as demonstrated empirically in shortest-path maze benchmarks (Diekhoff et al., 2024).

1. Deterministic Environment Model and Contrast to Model-Free Q-Learning

RBQL is formulated for settings where:

$S$ (state set) and $A$ (action set) are finite;
Transitions are deterministic: for each $(s,a)$ , the next state $s'$ and reward $r$ are fixed;
Episodes begin in a prescribed $s_0$ and terminate in one or more $s_T$ , the sole sources of positive reward.

Standard Q-learning, in contrast, does not construct or exploit a model of $P$ or $R$ , and updates $Q(s,a)$ incrementally at each step: $Q(s,a)\;\leftarrow\;Q(s,a)\;+\;\alpha\Bigl(R(s,a)\;+\;\gamma\max_{a'}Q\bigl(s',a'\bigr)\;-\;Q(s,a)\Bigr)$ This results in slow per-episode propagation of terminal rewards, especially in large deterministic environments. RBQL, instead, builds $\widehat P$ (deterministic transitions) and $\widehat R$ (rewards) as it explores. When a terminal state is reached, it backpropagates value estimates via model inversion and a single backward sweep, directly setting all $Q$ values for the explored set to their fixed-point optimal values.

2. Formal Definitions and Update Equations

Let $\mathcal{M}=(S,A,P,R)$ denote the true MDP. RBQL maintains estimates:

$\widehat P(s,a) \equiv s'$ : the deterministic next state;
$\widehat R(s,a) \equiv r$ : the reward;

Upon reaching terminal state(s) $s_T$ , RBQL assigns: $\forall a\in A:\quad Q(s_T,a)\;=\;\widehat R(s_T,a)=0$ For any predecessor $(s,a)$ , the recursive backup equation is \begin{equation} Q(s,a) = \widehat R(s,a) + \gamma\max_{a'\in A} Q(\widehat P(s,a), a') \end{equation}

To propagate values, the algorithm inverts the transition model to construct the predecessor set: $\mathrm{Pred}(s) = \{(s',a') \mid \widehat P(s',a') = s\}$ A breadth-first search (BFS) commencing from $s_T$ determines the update order, guaranteeing each $Q(s,a)$ is set only once per episode and always after all of its downstream dependencies are finalized.

3. Algorithmic Procedure

The central algorithmic workflow comprises two alternating phases: exploration/model-building and recursive backward value propagation. The following summarizes the key steps:

Initialization: All $\widehat P(s,a), \widehat R(s,a)$ marked "unknown"; all $Q(s,a)$ initialized to $0$.
Exploration: For each episode, the agent explores uncharted %%%%24%%%% pairs or exploits greedily based on current $Q$ values. Exploration scheduling leverages $\epsilon$ -decay.
Model Update: After executing $a$ in $s$ , the observed $(s',r)$ are recorded in $\widehat P, \widehat R$ .
Episode Termination: Once a terminal $s_T$ is reached, predecessors are computed by inverting $\widehat P$ .
Backward Sweep: A BFS order is established from $s_T$ over all explored states/actions. For each $(s',a')$ , $Q$ is updated according to the recursive backup equation.
Policy Improvement: Exploration flag and $\epsilon$ are decayed to balance exploration/exploitation.

No temporal-difference (TD) bootstrapping is performed; each $Q(s,a)$ is set to its global optimum with respect to the explored model after each episode.

4. Theoretical Properties and Convergence

Model Optimality: The RBQL backward sweep computes the true optimal $Q(s,a)$ values with respect to the currently explored model; each $Q(s,a)$ is set exactly once per sweep, with no further adjustment needed until new $(s,a)$ pairs are discovered.
Policy Optimality: Once all state–action pairs have been explored, RBQL reconstructs the globally optimal policy, as the single BFS backup is equivalent to dynamic programming value iteration on the discovered model.
Sample Complexity: Each episode allows all explored states to be evaluated optimally. For a maze of shortest-path depth $d$ , RBQL achieves solution in essentially one episode (plus exploration overhead), while standard Q-learning can require $O(d)$ episodes for equivalent credit propagation.
Computational Complexity: Each backward sweep is $O(|S| + |A||S|)$ over the explored subset.
Convergence Dynamics: Empirical evidence and theoretical argument demonstrate that the majority of learning occurs at the first terminal encounter; later episodes serve mainly to integrate newly discovered transitions or actions.

5. Empirical Evaluation: Maze Shortest-Path Task

Performance was measured on ensembles of $5\times5$ , $10\times10$ , and $15\times15$ random mazes, with each agent run for 25 episodes per maze (averaged over 50 random mazes). The principal metrics are average steps per episode and episode-over-episode improvement factors.

Table 1: Ratio of average steps (Q-Learning vs. RBQL) in episode 0 and episode 24.

Grid	QL steps (E0)	RBQL steps (E0)	Ratio (E0)	QL steps (E24)	RBQL steps (E24)	Ratio (E24)
$5\times5$	278.06	191.84	1.45	49.14	9.62	5.11
$10\times10$	3308.46	843.52	3.92	281.44	23.68	11.89
$15\times15$	7180.98	1965.00	3.65	778.68	35.96	21.65

Table 2: Step count reduction factors from episode 0 to 24.

Grid	QL Step $_0$	QL Step $_{24}$	QL Factor	RBQL Step $_0$	RBQL Step $_{24}$	RBQL Factor
$5\times5$	278.06	49.14	5.66	191.84	9.62	19.94
$10\times10$	3308.46	281.44	11.76	843.52	23.68	35.62
$15\times15$	7180.98	778.68	9.22	1965.00	35.96	90.76

For the $10\times10$ grid, after 6 episodes RBQL nearly attains optimal step counts (close to the lower bound $2s-2$), whereas Q-learning still averages hundreds of steps after 25 episodes.

6. Practical Considerations: Advantages, Limitations, and Extensions

Advantages

Accelerated value propagation: terminal reward is propagated instantly via a single backward sweep through the explored deterministic chain.
All discovered $Q$ values are set optimally in each episode.
Exploration focuses on as-yet-unseen actions, with greedy exploitation elsewhere.
Requires orders of magnitude fewer episodes and steps to converge compared to standard Q-learning.

Limitations

Applicability is restricted to deterministic transitions; for stochastic environments, direct adoption is not valid.
RBQL targets episodic MDPs with finite horizons and clear terminal states; continuous or infinite-horizon tasks are unsupported.
Requires storage of the full model ( $\widehat P$ , $\widehat R$ ) and incurs BFS computation per episode.

Potential Extensions

Extension to stochastic MDPs by estimating transition probabilities $p(s'|s,a)$ and generalizing the backup equation:

$Q(s,a)\leftarrow \sum_{s'} p(s'|s,a)\left(R(s,a,s') + \gamma\max_{a'} Q(s',a')\right)$

Model compression via state aggregation to mitigate memory and computation requirements.
Adaptation to multiple terminal states using either a dummy sink state or simultaneous backpropagation from all terminals.
Hybridization with function approximation (e.g., "Deep RBQL") to address large or continuous state spaces.

7. Summary

Recursive Backwards Q-Learning replaces per-transition temporal-difference learning with per-episode, model-based dynamic programming on the explored state-action graph, achieving rapid convergence for deterministic, episodic MDPs. By constructing and inverting the explicit environment model, RBQL computes optimal $Q$ values for all encountered transitions in a single pass upon each episode termination. Empirical evidence demonstrates superior sample efficiency and convergence rates over standard Q-learning in grid-world maze tasks. The methodology is well-suited for deterministic, finite, episodic problems and offers clear potential for extension to more general environments (Diekhoff et al., 2024).

Markdown Upgrade to Chat

References (1)

Recursive Backwards Q-Learning in Deterministic Environments (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recursive Backwards Q-Learning (RBQL).