Reinforcement Learning (2005.14419v2)

Published 29 May 2020 in cs.LG and stat.ML

Abstract: Reinforcement learning (RL) is a general framework for adaptive control, which has proven to be efficient in many domains, e.g., board games, video games or autonomous vehicles. In such problems, an agent faces a sequential decision-making problem where, at every time step, it observes its state, performs an action, receives a reward and moves to a new state. An RL agent learns by trial and error a good policy (or controller) based on observations and numeric reward feedback on the previously performed action. In this chapter, we present the basic framework of RL and recall the two main families of approaches that have been developed to learn a good policy. The first one, which is value-based, consists in estimating the value of an optimal policy, value from which a policy can be recovered, while the other, called policy search, directly works in a policy space. Actor-critic methods can be seen as a policy search technique where the policy value that is learned guides the policy improvement. Besides, we give an overview of some extensions of the standard RL framework, notably when risk-averse behavior needs to be taken into account or when rewards are not available or not known.

Authors (3)

Olivier Buffet (15 papers)
Olivier Pietquin (90 papers)
Paul Weng (39 papers)

Citations (1)

View on Semantic Scholar

Summary

This chapter provides a comprehensive overview of Reinforcement Learning (RL), a framework for adaptive control where agents learn optimal behaviors through trial-and-error interactions with an environment. It covers the foundational concepts, major algorithmic families, and important extensions relevant for practical applications.

1. Background

Markov Decision Process (MDP): RL problems are typically formalized as MDPs, defined by states ( $S$ ), actions ( $A$ ), transition probabilities ( $T(s, a, s')$ ), a reward function ( $R(s, a)$ ), a discount factor ( $\gamma$ ), and a horizon ( $H$ ).
Goal: The objective is to find a policy $\pi$ (a mapping from states to actions or action probabilities) that maximizes the expected discounted sum of future rewards (the return).
Value Functions:
- State-value function $v^\pi(s)$ : Expected return starting from state $s$ and following policy $\pi$ .
- Action-value function $Q^\pi(s, a)$ : Expected return starting from state $s$ , taking action $a$ , and then following policy $\pi$ .
BeLLMan Equations: These provide recursive relationships for value functions, forming the basis for many RL algorithms.
- BeLLMan evaluation equation (for a given $\pi$ ): $v^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s'} T(s, \pi(s), s') v^\pi(s')$
- BeLLMan optimality equation (for the optimal $v^*$ ): $v^*(s) = \max_{a} \left( R(s, a) + \gamma \sum_{s'} T(s, a, s') v^*(s') \right)$
Planning vs. Learning: Planning (e.g., Value Iteration, Policy Iteration) assumes a known MDP model. RL deals with unknown models, learning from interaction samples $(s, a, r, s')$ .
Core RL Algorithms (Tabular):
- TD(0): Estimates $v^\pi$ using updates based on the temporal difference error: $v(s) \leftarrow v(s) + \alpha (r + \gamma v(s') - v(s))$ .
- SARSA (On-Policy): Estimates $Q^\pi$ using the update: $Q(s, a) \leftarrow Q(s, a) + \alpha (r + \gamma Q(s', a') - Q(s, a))$ , where $a'$ is the action actually taken in $s'$ by the current policy.
- Q-learning (Off-Policy): Estimates the optimal $Q^*$ using the update: $Q(s, a) \leftarrow Q(s, a) + \alpha (r + \gamma \max_{a'} Q(s', a') - Q(s, a))$ . It learns the optimal policy regardless of the exploration policy used.

2. Value-Based Methods with Function Approximation

When state-action spaces are large, exact representation is infeasible. Function approximation parameterizes value functions (e.g., $v_{\bm\theta}(s)$ or $Q_{\bm\theta}(s, a)$ ) and learns the parameters $\bm\theta$ .

Linear Function Approximation: $v_{\bm\theta}(s) = \bm\theta^\intercal \bm\phi(s)$ or $Q_{\bm\theta}(s, a) = \bm\theta^\intercal \bm\phi(s, a)$ , using basis functions $\bm\phi$ .
Stochastic Gradient Descent (SGD) Methods:
- Bootstrapped Methods: Update parameters by minimizing the difference between the current estimate and a bootstrapped target (e.g., $r + \gamma v_{\bm\theta}(s')$ ). Examples include TD, SARSA, and Q-learning with function approximation.
- Linear TD(0) update: $\bm\theta \leftarrow \bm\theta + \alpha (r + \gamma \bm\theta^\intercal \bm\phi(s') - \bm\theta^\intercal \bm\phi(s)) \bm\phi(s)$
- Residual Methods: Minimize the BeLLMan residual directly (e.g., $v_{\bm\theta}(s) - (r + \gamma v_{\bm\theta}(s'))$ ). Requires careful handling due to correlation between $v_{\bm\theta}(s)$ and $v_{\bm\theta}(s')$ , often needing techniques like double sampling.
Least-Squares Methods: Offer potentially faster convergence than SGD.
- LSTD (Least-Squares Temporal Difference): Finds a closed-form solution for linear approximation by minimizing the projected BeLLMan error. Batch method.
- LSPI (Least-Squares Policy Iteration): Combines LSTD with policy improvement steps.
Iterative Projected Fixed-Point Methods: Apply BeLLMan operators and project the result back onto the function approximator's space. Convergence relies on the composed operator being a contraction, which isn't guaranteed with approximation.
- FQI (Fitted Q-Iteration): A popular implementation using batch learning and regression algorithms (like trees or neural networks).
Deep RL (Value-Based): Uses Deep Neural Networks (DNNs) as function approximators.
- Challenges: Data inefficiency, violation of i.i.d. data assumption for SGD, learning instability.
- DQN (Deep Q-Network): Addresses challenges using:
- Experience Replay: Stores transitions $(s, a, r, s')$ in a buffer and samples mini-batches randomly to break correlations and reuse data.
- Target Network: Uses a separate, slowly updated network ( $\bm\theta^-$ ) to generate stable TD targets: $y_t = r_t + \gamma \max_{b} Q_{\bm\theta^-}(s_t', b)$ . The main network ( $\bm\theta$ ) is updated towards this target.
- Improvements: Prioritized Experience Replay, Double DQN.

3. Policy Search Approaches

These methods directly optimize the parameters $\bm\theta$ of a policy $\pi_{\bm\theta}$ .

Advantages: Can handle continuous action spaces, allows incorporating domain knowledge via policy structure, can be more stable than value-based methods in some cases.
Model-Free vs. Model-Based:
- Model-Free: Updates policy parameters directly from sampled trajectories.
- Exploration: Achieved by sampling parameters or perturbing actions. Stochastic policies naturally explore.
- Evaluation: Can be step-based (low variance, uses Q-values or Monte Carlo) or episode-based (higher variance, uses full returns).
- Update Mechanisms:
- Policy Gradient (PG): Estimate gradient $\nabla_{\bm\theta} J(\bm\theta)$ via finite differences or likelihood ratio (e.g., REINFORCE). Natural Policy Gradient (NPG) uses Fisher Information Matrix for more stable steps. Actor-Critic methods combine policy updates with learned value functions (e.g., DDPG, A3C).
- Inference-based: Frame as inference (e.g., EM), using Monte Carlo estimates (e.g., RWR, PoWER).
- Information-theoretic: Bound policy/trajectory distribution changes (e.g., REPS, TRPO).
- Stochastic Optimization: Use black-box optimizers (e.g., CEM, CMA-ES).
- Path Integral (PI): Optimize movement primitives (e.g., PI $^2$ ).
- Model-Based: Learn a model of the environment dynamics and use it (often as a simulator) to optimize the policy.
- Model Learning: Often uses probabilistic models (e.g., Gaussian Processes, DBNs for factored MDPs) to handle uncertainty and stochasticity.
- Long-Term Prediction: Using the learned model can be biased. Techniques like PEGASUS (using fixed random seeds) or deterministic approximations help.
- Policy Update: Can use gradient-free, sampling-based, or analytical gradients (if model/policy are differentiable).

4. Extensions

Unknown Rewards / Reward Learning: Designing reward functions is hard.
- Inverse Reinforcement Learning (IRL): Learn a reward function $R_{\bm\theta}(s, a)$ from expert demonstrations.
- Challenges: Ill-posed problem (degeneracy).
- Approaches: Feature matching (match expected feature counts of expert and learned policy), max-margin methods, structured classification, Bayesian IRL, Maximum Entropy IRL (chooses the least constrained reward function explaining demonstrations).
- Learning from other feedback: Preferences, ratings, comparisons.
Preference-Based RL: Defines optimality based on pairwise trajectory comparisons ( $\mathbb{P}[h^\pi \succsim h^{\pi'}] \ge \mathbb{P}[h^{\pi'} \succsim h^\pi]$ ), avoiding explicit reward functions. Requires handling potential preference cycles (e.g., using mixed strategies).
Risk-Sensitive RL: Addresses limitations of risk-neutral expectation criterion.
- Criteria: Minimize failure probability, use risk-sensitive utility functions (e.g., exponential, quadratic leading to variance penalty), optimize risk measures (e.g., Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR)).
- Approaches: Policy gradient for CVaR, two-timescale methods for risk constraints.
- Challenge: BeLLMan optimality principle may not hold; often requires state augmentation (e.g., including cumulative reward in the state).

5. Conclusion

RL is a powerful framework with growing applications, driven partly by Deep Learning advances. Key challenges remain, particularly sample and computational efficiency. Active research areas include leveraging structure, incorporating prior knowledge, transfer learning, lifelong learning, and multi-task learning to make RL more practical for complex, real-world domains.

PDF Markdown

Related Papers

A Survey of Meta-Reinforcement Learning (2023)
A Brief Survey of Deep Reinforcement Learning (2017)
Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms (2019)
Reinforcement learning (2024)
An Introduction to Deep Reinforcement Learning (2018)

Find Related Papers

Tweets

https://twitter.com/GlenwoodNate/status/1883346469819957673

https://twitter.com/GlenwoodNate/status/1883629733650497740