Papers
Topics
Authors
Recent
2000 character limit reached

Direct Reinforcement Learning (Direct RL)

Updated 21 December 2025
  • Direct RL is a reinforcement learning approach that directly optimizes an explicit policy using gradient ascent to maximize expected cumulative reward.
  • It contrasts with value-based methods by updating policy parameters directly, employing techniques like REINFORCE and actor-critic for improved convergence.
  • Direct RL is particularly effective in high-dimensional continuous control tasks, offering faster convergence and comprehensive state coverage.

Direct Reinforcement Learning (Direct RL) is a class of reinforcement learning algorithms that seek optimal policies in Markov Decision Processes (MDPs) by directly maximizing a policy’s expected cumulative reward through gradient-based optimization. Unlike value-based or indirect RL methods, which solve variants of the Bellman equation and induce a policy implicitly, Direct RL maintains an explicit parameterization of the policy and formulates policy improvement as stochastic, black-box optimization. This approach encompasses many modern policy-gradient and actor-critic algorithms and is particularly suited to high-dimensional or continuous control tasks where value-function approximation or model learning are challenging (Guan et al., 2019, Yaghmaie et al., 2021).

1. Definition and Distinction from Indirect RL

In Direct RL, the agent’s policy πθ(as)\pi_\theta(a|s) is explicitly parameterized and the objective is to maximize the expected (discounted) total return:

J(θ)=Eτπθ[t=0γtrt]J(\theta) = E_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t \right]

The optimization is performed directly over the policy parameters θ\theta via gradient ascent. In contrast, Indirect RL methods instead attempt to find the fixed point of the Bellman optimality equation (on the value function vv^* or QQ^*), and then derive a policy by acting greedily or stochastically with respect to the learned values. In indirect methods, the policy is implicit, typically the result of a post-optimization step. The essential contrast lies in whether the policy parameters are updated by direct maximization (Direct RL) or by first solving for value functions (Indirect RL) (Guan et al., 2019, Yaghmaie et al., 2021).

2. Policy Gradient Theorem and Gradient Estimation

The mathematical foundation of Direct RL is the policy gradient theorem. Starting from

J(θ)=Es0d0,atπθ,st+1p[t=0γtr(st,at)]J(\theta) = E_{s_0 \sim d^0, a_t \sim \pi_\theta, s_{t+1} \sim p} \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \right]

and defining the γ\gamma-discounted state occupancy dγ(sπθ)d^\gamma(s|\pi_\theta), the gradient can be written as:

θJ(θ)=sdγ(sπθ)aπθ(as)θlogπθ(as)Qπθ(s,a)\nabla_\theta J(\theta) = \sum_s d^\gamma(s|\pi_\theta) \sum_a \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a)

Under mild assumptions (irreducible, aperiodic Markov chain, and for γ1\gamma \to 1), normalized dγ(s)d^\gamma(s) can be replaced with the stationary distribution dπ(s)d^\pi(s), yielding the standard form:

θJ(θ)=Esdπ,aπθ[θlogπθ(as)Qπ(s,a)]\nabla_\theta J(\theta) = E_{s \sim d^\pi, a \sim \pi_\theta} [\nabla_\theta \log \pi_\theta(a|s) Q^\pi(s,a)]

These gradients are estimated from sampled rollouts and used in Monte Carlo and actor-critic algorithms (Guan et al., 2019, Yaghmaie et al., 2021).

3. Algorithmic Implementations: REINFORCE, Actor-Critic, and Variants

A canonical instantiation of Direct RL is the REINFORCE algorithm, which uses sampled trajectories to estimate the policy gradient via the “log-likelihood times return” formula. Variance in these estimates is controlled by adopting the “reward-to-go” GtG_t at each timestep and subtracting a baseline b(st)b(s_t):

θJ(θ)1Mi=1Mt=1Tθlogπθ(at(i)st(i))(Gt(i)b(st(i)))\nabla_\theta J(\theta) \approx \frac{1}{M} \sum_{i=1}^M \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) (G_t^{(i)} - b(s_t^{(i)}))

Actor-critic methods extend this by fitting a parametric critic Qw(s,a)Q_w(s,a) or Vw(s)V_w(s) to approximate value functions and using these in place of the empirical returns for gradient estimation. The actor update becomes:

θJ(θ)Es,a[θlogπθ(as)Qw(s,a)]\nabla_\theta J(\theta) \approx E_{s, a} [\nabla_\theta \log \pi_\theta(a|s) Q_w(s,a)]

Alternating actor (policy) and critic (value) updates yields the actor-critic architecture, ubiquitous in modern continuous control RL (Guan et al., 2019, Yaghmaie et al., 2021). Variance-reduction techniques (advantage estimation, baseline functions, control variates) are employed for sample efficiency (Yaghmaie et al., 2021).

4. The Role of the State Distribution and Convergence Properties

Accurate estimation of the policy gradient depends crucially on sampling states from the correct distribution. The stationary distribution dπ(s)d^\pi(s) under policy π\pi ensures unbiased estimation as it reflects the long-term visitation frequencies of states. If policy updates are performed using samples from the initial distribution d0d^0, learning may fail for states not present in d0d^0. Empirically, using dπd^π or its discounted counterpart dγd^γ enables optimization over all reachable states. These properties are experimentally validated in Gridworld settings, where policy-gradient variants using the stationary or discounted state distributions outperform versions constrained by the initial state distribution (Guan et al., 2019).

5. Empirical Comparisons and Experimental Benchmarks

In experiments comparing five policy-gradient variants in a 16-state Gridworld (γ=0.9\gamma=0.9), Direct PG (using dγd^\gamma and true Monte Carlo returns) achieves broad state coverage and faster convergence compared to Indirect PG (using d0d^0 and an approximate critic). Unified policy-gradient formulations, employing the stationary state distribution and approximate critics, interpolate between Direct and Indirect RL and match the best empirical performance. Policy-entropy curves show that all variants eventually induce greedy policies, but indirect forms with approximate value functions converge more slowly and may fail to learn optimal policies if the initial distribution omits reachable states (Guan et al., 2019).

6. Taxonomy of RL Algorithms: Direct vs Indirect, Model-Free vs Model-Based

The direct/indirect RL classification is orthogonal to the traditional distinctions of value-based/policy-based and model-based/model-free. The major classes of algorithms can be summarized as:

Category Direct RL (policy-based) Indirect RL (value-based/approx policy iter)
Model-free REINFORCE, A2C/A3C, DDPG, TD3, SAC Q-learning, DQN, SARSA, Distributional RL
Model-based PILCO, ME-TRPO, Dreamer, GPS, MVE Dyna, PILCO (indirect), MBPO, STEVE

Direct RL sits squarely in the “policy-based” column and spans both model-free and model-based approaches. Its hallmark is explicit optimization of the long-horizon return objective, rather than fixed-point iteration over value functions (Guan et al., 2019).

Recent advances extend Direct RL to settings where reward specification is unavailable or impractical, by maximizing the future probability of reaching states sampled as “success examples.” In these example-based policy search methods, a recursive classification approach directly optimizes a value function, bypassing explicit reward learning and satisfying a data-driven Bellman recursion. Theoretical analysis shows convergence guarantees in the tabular limit and monotonic improvement. Empirical results on continuous control and vision-based manipulation tasks confirm the empirical advantages of Direct RL variants using example-based task specifications (Eysenbach et al., 2021).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Direct Reinforcement Learning (RL).