Direct Reinforcement Learning (Direct RL)

Updated 21 December 2025

Direct RL is a reinforcement learning approach that directly optimizes an explicit policy using gradient ascent to maximize expected cumulative reward.
It contrasts with value-based methods by updating policy parameters directly, employing techniques like REINFORCE and actor-critic for improved convergence.
Direct RL is particularly effective in high-dimensional continuous control tasks, offering faster convergence and comprehensive state coverage.

Direct Reinforcement Learning (Direct RL) is a class of reinforcement learning algorithms that seek optimal policies in Markov Decision Processes (MDPs) by directly maximizing a policy’s expected cumulative reward through gradient-based optimization. Unlike value-based or indirect RL methods, which solve variants of the Bellman equation and induce a policy implicitly, Direct RL maintains an explicit parameterization of the policy and formulates policy improvement as stochastic, black-box optimization. This approach encompasses many modern policy-gradient and actor-critic algorithms and is particularly suited to high-dimensional or continuous control tasks where value-function approximation or model learning are challenging (Guan et al., 2019, Yaghmaie et al., 2021).

1. Definition and Distinction from Indirect RL

In Direct RL, the agent’s policy $\pi_\theta(a|s)$ is explicitly parameterized and the objective is to maximize the expected (discounted) total return:

$J(\theta) = E_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t \right]$

The optimization is performed directly over the policy parameters $\theta$ via gradient ascent. In contrast, Indirect RL methods instead attempt to find the fixed point of the Bellman optimality equation (on the value function $v^*$ or $Q^*$ ), and then derive a policy by acting greedily or stochastically with respect to the learned values. In indirect methods, the policy is implicit, typically the result of a post-optimization step. The essential contrast lies in whether the policy parameters are updated by direct maximization (Direct RL) or by first solving for value functions (Indirect RL) (Guan et al., 2019, Yaghmaie et al., 2021).

2. Policy Gradient Theorem and Gradient Estimation

The mathematical foundation of Direct RL is the policy gradient theorem. Starting from

$J(\theta) = E_{s_0 \sim d^0, a_t \sim \pi_\theta, s_{t+1} \sim p} \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \right]$

and defining the $\gamma$ -discounted state occupancy $d^\gamma(s|\pi_\theta)$ , the gradient can be written as:

$\nabla_\theta J(\theta) = \sum_s d^\gamma(s|\pi_\theta) \sum_a \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a)$

Under mild assumptions (irreducible, aperiodic Markov chain, and for $\gamma \to 1$ ), normalized $d^\gamma(s)$ can be replaced with the stationary distribution $d^\pi(s)$ , yielding the standard form:

$\nabla_\theta J(\theta) = E_{s \sim d^\pi, a \sim \pi_\theta} [\nabla_\theta \log \pi_\theta(a|s) Q^\pi(s,a)]$

These gradients are estimated from sampled rollouts and used in Monte Carlo and actor-critic algorithms (Guan et al., 2019, Yaghmaie et al., 2021).

3. Algorithmic Implementations: REINFORCE, Actor-Critic, and Variants

A canonical instantiation of Direct RL is the REINFORCE algorithm, which uses sampled trajectories to estimate the policy gradient via the “log-likelihood times return” formula. Variance in these estimates is controlled by adopting the “reward-to-go” $G_t$ at each timestep and subtracting a baseline $b(s_t)$ :

$\nabla_\theta J(\theta) \approx \frac{1}{M} \sum_{i=1}^M \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) (G_t^{(i)} - b(s_t^{(i)}))$

Actor-critic methods extend this by fitting a parametric critic $Q_w(s,a)$ or $V_w(s)$ to approximate value functions and using these in place of the empirical returns for gradient estimation. The actor update becomes:

$\nabla_\theta J(\theta) \approx E_{s, a} [\nabla_\theta \log \pi_\theta(a|s) Q_w(s,a)]$

Alternating actor (policy) and critic (value) updates yields the actor-critic architecture, ubiquitous in modern continuous control RL (Guan et al., 2019, Yaghmaie et al., 2021). Variance-reduction techniques (advantage estimation, baseline functions, control variates) are employed for sample efficiency (Yaghmaie et al., 2021).

4. The Role of the State Distribution and Convergence Properties

Accurate estimation of the policy gradient depends crucially on sampling states from the correct distribution. The stationary distribution $d^\pi(s)$ under policy $\pi$ ensures unbiased estimation as it reflects the long-term visitation frequencies of states. If policy updates are performed using samples from the initial distribution $d^0$ , learning may fail for states not present in $d^0$ . Empirically, using $d^π$ or its discounted counterpart $d^γ$ enables optimization over all reachable states. These properties are experimentally validated in Gridworld settings, where policy-gradient variants using the stationary or discounted state distributions outperform versions constrained by the initial state distribution (Guan et al., 2019).

5. Empirical Comparisons and Experimental Benchmarks

In experiments comparing five policy-gradient variants in a 16-state Gridworld ( $\gamma=0.9$ ), Direct PG (using $d^\gamma$ and true Monte Carlo returns) achieves broad state coverage and faster convergence compared to Indirect PG (using $d^0$ and an approximate critic). Unified policy-gradient formulations, employing the stationary state distribution and approximate critics, interpolate between Direct and Indirect RL and match the best empirical performance. Policy-entropy curves show that all variants eventually induce greedy policies, but indirect forms with approximate value functions converge more slowly and may fail to learn optimal policies if the initial distribution omits reachable states (Guan et al., 2019).

6. Taxonomy of RL Algorithms: Direct vs Indirect, Model-Free vs Model-Based

The direct/indirect RL classification is orthogonal to the traditional distinctions of value-based/policy-based and model-based/model-free. The major classes of algorithms can be summarized as:

Category	Direct RL (policy-based)	Indirect RL (value-based/approx policy iter)
Model-free	REINFORCE, A2C/A3C, DDPG, TD3, SAC	Q-learning, DQN, SARSA, Distributional RL
Model-based	PILCO, ME-TRPO, Dreamer, GPS, MVE	Dyna, PILCO (indirect), MBPO, STEVE

Direct RL sits squarely in the “policy-based” column and spans both model-free and model-based approaches. Its hallmark is explicit optimization of the long-horizon return objective, rather than fixed-point iteration over value functions (Guan et al., 2019).

7. Extensions: Example-Based Direct Policy Search

Recent advances extend Direct RL to settings where reward specification is unavailable or impractical, by maximizing the future probability of reaching states sampled as “success examples.” In these example-based policy search methods, a recursive classification approach directly optimizes a value function, bypassing explicit reward learning and satisfying a data-driven Bellman recursion. Theoretical analysis shows convergence guarantees in the tabular limit and monotonic improvement. Empirical results on continuous control and vision-based manipulation tasks confirm the empirical advantages of Direct RL variants using example-based task specifications (Eysenbach et al., 2021).

References

“Direct and indirect reinforcement learning” by Guan et al. (Guan et al., 2019)
“A Crash Course on Reinforcement Learning” (Yaghmaie et al., 2021)
“Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification” (Eysenbach et al., 2021)

PDF Markdown Chat (Pro)

References (3)

Direct and indirect reinforcement learning (2019)

A Crash Course on Reinforcement Learning (2021)

Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Direct Reinforcement Learning (RL).