Papers
Topics
Authors
Recent
2000 character limit reached

Reinforcement Learning

Published 29 May 2020 in cs.LG and stat.ML | (2005.14419v2)

Abstract: Reinforcement learning (RL) is a general framework for adaptive control, which has proven to be efficient in many domains, e.g., board games, video games or autonomous vehicles. In such problems, an agent faces a sequential decision-making problem where, at every time step, it observes its state, performs an action, receives a reward and moves to a new state. An RL agent learns by trial and error a good policy (or controller) based on observations and numeric reward feedback on the previously performed action. In this chapter, we present the basic framework of RL and recall the two main families of approaches that have been developed to learn a good policy. The first one, which is value-based, consists in estimating the value of an optimal policy, value from which a policy can be recovered, while the other, called policy search, directly works in a policy space. Actor-critic methods can be seen as a policy search technique where the policy value that is learned guides the policy improvement. Besides, we give an overview of some extensions of the standard RL framework, notably when risk-averse behavior needs to be taken into account or when rewards are not available or not known.

Citations (1)

Summary

  • The paper presents a comprehensive exposition of reinforcement learning, covering foundational MDPs, value-based methods, policy search, and advanced extensions like inverse RL.
  • The paper outlines methodologies that leverage both linear and deep function approximation to stabilize Q-learning and address challenges such as sample efficiency and divergence.
  • The paper highlights the practical implications of risk-sensitive, multiobjective, and robust RL approaches in advancing the state-of-the-art in adaptive autonomous systems.

Reinforcement Learning: Foundations, Algorithms, and Extensions

Introduction

This chapter presents a comprehensive exposition of reinforcement learning (RL) as a framework for adaptive sequential decision-making. The focus encompasses the foundational Markov Decision Process (MDP) formalism, main algorithmic families (value-based and policy search), their realizations using function approximation and deep networks, and advanced topics including inverse RL, reward learning, and risk-sensitive approaches. The chapter also critically discusses sample efficiency and generalization challenges, providing technical guidance and highlighting ongoing research directions.

Markov Decision Processes and the Reinforcement Learning Setting

MDPs are defined by M=S,A,T,R,γ,H\mathcal{M} = \langle S, A, T, R, \gamma, H\rangle, capturing the state space SS, action space AA, stationary stochastic transition kernel TT, reward function RR, discount factor γ\gamma, and horizon HH. Optimal control is formalized by maximizing the expectation of (possibly discounted) cumulative rewards, with optimality conditions encoded in the Bellman equations. Policy iteration and value iteration constitute the canonical algorithmic frameworks for known dynamics.

In RL, the transition and/or reward functions are unknown; learning agents must optimize behavior through experience, estimating value and policy representations incrementally from sampled transitions. The convergence properties require careful tuning of learning rates and exploration strategies, especially in stochastic or non-stationary environments.

Value-Based Methods and Function Approximation

Tabular methods suffice only for small-scale domains. Generalization to large or continuous domains necessitates function approximation: value functions VV and QQ are parameterized either linearly (basis expansion) or via nonlinear architectures (neural networks). Linear function approximation preserves certain theoretical convergence guarantees but is limited by basis expressivity. The semantics of gradient-based updates depend on the choice of the loss: bootstrapped targets (temporal difference), residual methods (direct Bellman residual minimization), or least-squares projections (LSTD, LSPI).

The LSPI approach, for example, enables efficient batch policy iteration with error bounds, provided the projected Bellman operator remains contractive in the projected space. However, the non-linearity and non-differentiability induced by max\max operators in QQ-learning remain a source of practical difficulty. The contraction property of dynamic programming does not always transfer to function approximation settings, leading to issues such as divergence and policy oscillation.

Value-Based Deep Reinforcement Learning

Value-based deep RL leverages expressive neural architectures to approximate QQ-functions or value functions in high-dimensional environments. The Deep Q-Network (DQN) methodology integrates experience replay and moving target networks to address the violations of the i.i.d. assumption and non-stationarity in policy-induced data distributions. Experience replay buffers and frozen target networks implicitly decorrelate data, enhancing the stability of SGD updates.

Extensions such as prioritized experience replay and double Q-learning further address the overestimation bias and inefficient sampling. Notably, DQN and its variants have demonstrated ability to reach or surpass human-level performance on complex visual control benchmarks (e.g., Atari 2600). However, high sample complexity and sensitivity to hyperparameters remain limiting factors.

Policy Search Approaches

Policy search (direct policy optimization) circumvents value estimation and acts directly in the policy parameter space (parameterized by θ\bm\theta). Model-free methods include stochastic policy gradients (REINFORCE, G(PO)MDP, NPG, DPG) and black-box search (CMA-ES, CEM, NES), with variance-reduction and trust-region constraints (TRPO) now standard for robust gradient estimation. Actor-critic architectures combine value and policy updates and are well-suited to both discrete and continuous action spaces. Asynchronous parallelism has further improved wall-clock efficiency and robustness to local optima.

Model-based policy search applies surrogate world models (e.g., GPs, Bayesian networks) to decouple sample collection and policy optimization. This enables more data-efficient learning but introduces potential for model bias. Sophisticated techniques—such as PEGASUS for variance reduction and PILCO for sample-efficient continuous control—exploit analytic gradients and uncertainty-aware planning for scalable synthesis.

Advanced and Extended RL Settings

Reward Learning and Inverse RL

Specialized techniques are required when the reward function is unknown or ill-defined. Apprenticeship learning and inverse RL (IRL) focus on inferring reward functions from expert demonstrations, framing policy induction as an inverse problem. Parametric (often linear) reward models admit efficient feature-matching and max-margin formulations, yet IRL is fundamentally ill-posed and must address degeneracy and identifiability (e.g., via maximum entropy regularization).

Recent contributions utilize Bayesian priors, interactive query selection, preference-based elicitation, and deep architectures for high-dimensional or weakly supervised settings. Robust reward learning is essential for real-world deployment in safety-critical or poorly specified domains, but algorithmic guarantees are generally weaker than for standard RL.

Preference-Based and Multiobjective RL

Preference-based RL dispenses with scalar rewards and instead operates on pairwise or ordinal preferences, often yielding more flexible optimization criteria (e.g., quantile-based, Gini, or social choice-inspired preference models). Policy improvement in these frameworks may lack transitivity or lead to cyclical preferences; Nash equilibrium interpretations and mixed policies can circumvent the absence of convexity and enable well-defined optima.

Multiobjective RL extends the formalism for settings where multiple (potentially conflicting) criteria are optimized. Pareto efficiency, compromise programming, and fairness constraints constitute key principles, but their integration into scalable, policy-gradient-based frameworks remains open.

Risk-Sensitive and Robust RL

Standard RL is expectation-centric and thus risk-neutral. In applications where tail events or variance matter (finance, safety), risk-sensitive formulations are essential. Approaches range from augmenting rewards with variance penalties, to optimizing Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR), or coherent risk measures. The lack of sub-policy optimality in most risk-sensitive MDPs can be mitigated by augmentation of the state space; actor-critic and policy gradient variants for CVaR and associated constraints are now well-developed but require substantial computational overhead.

Open Challenges and Future Directions

Despite recent empirical successes, RL remains plagued by data inefficiency, high sample complexity, and difficult reproducibility. Advances in exploiting problem structure (factored states, hierarchical or temporal abstraction, transfer learning) are vital for realistic applications. Integration of rich prior knowledge via logical constraints or expert advice, lifelong and multitask learning protocols, and curriculum learning provide promising paths forward.

Robust generalization beyond the narrow distribution of training experience requires not only better function approximation and exploration policies but also rigorous definitions of distributional robustness, safety, and fairness in sequential decision-making.

Conclusion

This chapter delineates the theoretical foundations and algorithmic landscape of reinforcement learning, highlighting the strengths and limitations of existing methodologies. As RL continues to intersect with deep learning, robust optimization, preference learning, and human-in-the-loop paradigms, it is poised to enable more general and adaptive autonomous agents. However, fundamental limitations around sample efficiency, reward specification, and theoretical guarantees continue to motivate research at the intersection of control theory, operations research, and statistical machine learning.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 11 likes about this paper.