This chapter provides a comprehensive overview of Reinforcement Learning (RL), a framework for adaptive control where agents learn optimal behaviors through trial-and-error interactions with an environment. It covers the foundational concepts, major algorithmic families, and important extensions relevant for practical applications.
1. Background
- Markov Decision Process (MDP): RL problems are typically formalized as MDPs, defined by states (S), actions (A), transition probabilities (T(s,a,s′)), a reward function (R(s,a)), a discount factor (γ), and a horizon (H).
- Goal: The objective is to find a policy π (a mapping from states to actions or action probabilities) that maximizes the expected discounted sum of future rewards (the return).
- Value Functions:
- State-value function vπ(s): Expected return starting from state s and following policy π.
- Action-value function Qπ(s,a): Expected return starting from state s, taking action a, and then following policy π.
- BeLLMan Equations: These provide recursive relationships for value functions, forming the basis for many RL algorithms.
- BeLLMan evaluation equation (for a given π): vπ(s)=R(s,π(s))+γs′∑T(s,π(s),s′)vπ(s′)
- BeLLMan optimality equation (for the optimal v∗): v∗(s)=amax(R(s,a)+γs′∑T(s,a,s′)v∗(s′))
- Planning vs. Learning: Planning (e.g., Value Iteration, Policy Iteration) assumes a known MDP model. RL deals with unknown models, learning from interaction samples (s,a,r,s′).
- Core RL Algorithms (Tabular):
- TD(0): Estimates vπ using updates based on the temporal difference error: v(s)←v(s)+α(r+γv(s′)−v(s)).
- SARSA (On-Policy): Estimates Qπ using the update: Q(s,a)←Q(s,a)+α(r+γQ(s′,a′)−Q(s,a)), where a′ is the action actually taken in s′ by the current policy.
- Q-learning (Off-Policy): Estimates the optimal Q∗ using the update: Q(s,a)←Q(s,a)+α(r+γa′maxQ(s′,a′)−Q(s,a)). It learns the optimal policy regardless of the exploration policy used.
2. Value-Based Methods with Function Approximation
When state-action spaces are large, exact representation is infeasible. Function approximation parameterizes value functions (e.g., vθ(s) or Qθ(s,a)) and learns the parameters θ.
- Linear Function Approximation: vθ(s)=θ⊺ϕ(s) or Qθ(s,a)=θ⊺ϕ(s,a), using basis functions ϕ.
- Stochastic Gradient Descent (SGD) Methods:
- Bootstrapped Methods: Update parameters by minimizing the difference between the current estimate and a bootstrapped target (e.g., r+γvθ(s′)). Examples include TD, SARSA, and Q-learning with function approximation.
- Linear TD(0) update: θ←θ+α(r+γθ⊺ϕ(s′)−θ⊺ϕ(s))ϕ(s)
- Residual Methods: Minimize the BeLLMan residual directly (e.g., vθ(s)−(r+γvθ(s′))). Requires careful handling due to correlation between vθ(s) and vθ(s′), often needing techniques like double sampling.
- Least-Squares Methods: Offer potentially faster convergence than SGD.
- LSTD (Least-Squares Temporal Difference): Finds a closed-form solution for linear approximation by minimizing the projected BeLLMan error. Batch method.
- LSPI (Least-Squares Policy Iteration): Combines LSTD with policy improvement steps.
- Iterative Projected Fixed-Point Methods: Apply BeLLMan operators and project the result back onto the function approximator's space. Convergence relies on the composed operator being a contraction, which isn't guaranteed with approximation.
- FQI (Fitted Q-Iteration): A popular implementation using batch learning and regression algorithms (like trees or neural networks).
- Deep RL (Value-Based): Uses Deep Neural Networks (DNNs) as function approximators.
- Challenges: Data inefficiency, violation of i.i.d. data assumption for SGD, learning instability.
- DQN (Deep Q-Network): Addresses challenges using:
- Experience Replay: Stores transitions (s,a,r,s′) in a buffer and samples mini-batches randomly to break correlations and reuse data.
- Target Network: Uses a separate, slowly updated network (θ−) to generate stable TD targets: yt=rt+γbmaxQθ−(st′,b). The main network (θ) is updated towards this target.
- Improvements: Prioritized Experience Replay, Double DQN.
3. Policy Search Approaches
These methods directly optimize the parameters θ of a policy πθ.
- Advantages: Can handle continuous action spaces, allows incorporating domain knowledge via policy structure, can be more stable than value-based methods in some cases.
- Model-Free vs. Model-Based:
- Model-Free: Updates policy parameters directly from sampled trajectories.
- Exploration: Achieved by sampling parameters or perturbing actions. Stochastic policies naturally explore.
- Evaluation: Can be step-based (low variance, uses Q-values or Monte Carlo) or episode-based (higher variance, uses full returns).
- Update Mechanisms:
- Policy Gradient (PG): Estimate gradient ∇θJ(θ) via finite differences or likelihood ratio (e.g., REINFORCE). Natural Policy Gradient (NPG) uses Fisher Information Matrix for more stable steps. Actor-Critic methods combine policy updates with learned value functions (e.g., DDPG, A3C).
- Inference-based: Frame as inference (e.g., EM), using Monte Carlo estimates (e.g., RWR, PoWER).
- Information-theoretic: Bound policy/trajectory distribution changes (e.g., REPS, TRPO).
- Stochastic Optimization: Use black-box optimizers (e.g., CEM, CMA-ES).
- Path Integral (PI): Optimize movement primitives (e.g., PI2).
- Model-Based: Learn a model of the environment dynamics and use it (often as a simulator) to optimize the policy.
- Model Learning: Often uses probabilistic models (e.g., Gaussian Processes, DBNs for factored MDPs) to handle uncertainty and stochasticity.
- Long-Term Prediction: Using the learned model can be biased. Techniques like PEGASUS (using fixed random seeds) or deterministic approximations help.
- Policy Update: Can use gradient-free, sampling-based, or analytical gradients (if model/policy are differentiable).
4. Extensions
- Unknown Rewards / Reward Learning: Designing reward functions is hard.
- Inverse Reinforcement Learning (IRL): Learn a reward function Rθ(s,a) from expert demonstrations.
- Challenges: Ill-posed problem (degeneracy).
- Approaches: Feature matching (match expected feature counts of expert and learned policy), max-margin methods, structured classification, Bayesian IRL, Maximum Entropy IRL (chooses the least constrained reward function explaining demonstrations).
- Learning from other feedback: Preferences, ratings, comparisons.
- Preference-Based RL: Defines optimality based on pairwise trajectory comparisons (P[hπ≿hπ′]≥P[hπ′≿hπ]), avoiding explicit reward functions. Requires handling potential preference cycles (e.g., using mixed strategies).
- Risk-Sensitive RL: Addresses limitations of risk-neutral expectation criterion.
- Criteria: Minimize failure probability, use risk-sensitive utility functions (e.g., exponential, quadratic leading to variance penalty), optimize risk measures (e.g., Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR)).
- Approaches: Policy gradient for CVaR, two-timescale methods for risk constraints.
- Challenge: BeLLMan optimality principle may not hold; often requires state augmentation (e.g., including cumulative reward in the state).
5. Conclusion
RL is a powerful framework with growing applications, driven partly by Deep Learning advances. Key challenges remain, particularly sample and computational efficiency. Active research areas include leveraging structure, incorporating prior knowledge, transfer learning, lifelong learning, and multi-task learning to make RL more practical for complex, real-world domains.