BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning (1806.06161v2)

Published 16 Jun 2018 in cs.RO, cs.LG, and cs.SY

Abstract: Model-free Reinforcement Learning (RL) offers an attractive approach to learn control policies for high-dimensional systems, but its relatively poor sample complexity often forces training in simulated environments. Even in simulation, goal-directed tasks whose natural reward function is sparse remain intractable for state-of-the-art model-free algorithms for continuous control. The bottleneck in these tasks is the prohibitive amount of exploration required to obtain a learning signal from the initial state of the system. In this work, we leverage physical priors in the form of an approximate system dynamics model to design a curriculum scheme for a model-free policy optimization algorithm. Our Backward Reachability Curriculum (BaRC) begins policy training from states that require a small number of actions to accomplish the task, and expands the initial state distribution backwards in a dynamically-consistent manner once the policy optimization algorithm demonstrates sufficient performance. BaRC is general, in that it can accelerate training of any model-free RL algorithm on a broad class of goal-directed continuous control MDPs. Its curriculum strategy is physically intuitive, easy-to-tune, and allows incorporating physical priors to accelerate training without hindering the performance, flexibility, and applicability of the model-free RL algorithm. We evaluate our approach on two representative dynamic robotic learning problems and find substantial performance improvement relative to previous curriculum generation techniques and naive exploration strategies.

Authors (5)

Boris Ivanovic (62 papers)
James Harrison (44 papers)
Apoorva Sharma (24 papers)
Mo Chen (95 papers)
Marco Pavone (314 papers)

Citations (54)

View on Semantic Scholar

Summary

Reinforcement Learning (RL) offers a powerful approach for learning control policies for complex robotic tasks. However, a major bottleneck in applying model-free RL to robotics is its high sample complexity, particularly for tasks with sparse reward functions. Standard exploration strategies like $\epsilon$ -greedy or adding noise are inefficient in these scenarios, often failing to discover the goal region within a reasonable number of training episodes. Training in simulation helps but still requires millions of trials.

The paper "BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning" (Ivanovic et al., 2018 ) proposes Backward Reachability Curriculum (BaRC) to address this sample efficiency issue. BaRC leverages physical prior knowledge in the form of an approximate system dynamics model to generate a curriculum for any model-free RL algorithm. The core idea is to start training the policy from states that are dynamically close to the goal state and progressively expand the set of initial training states backward in time, guided by the system's dynamics.

How BaRC Works

BaRC operates as a wrapper around a standard model-free RL training loop. It structures the training process into stages, each defined by a set of initial states from which the policy is trained.

Initialization: The curriculum begins with a set of initial states (starts) containing states very close to the goal region (e.g., a single goal state). An oldstarts buffer is also initialized with these goal states to prevent forgetting. The RL policy $\pi$ is randomly initialized.
Curriculum Stage Expansion: At the beginning of each stage, the starts set is expanded backward in time using backward reachable set (BRS) computation based on an approximate system dynamics model (ExpandBackwards function). The expanded set, starts_set, represents states from which the goal region can be reached within a short time horizon $T$ using the approximate dynamics. This step provides new, slightly harder states to train from.
Policy Training Loop: Within each stage, the policy is trained iteratively (TrainPolicy function) using a mix of initial states: $N_\text{new}$ states sampled uniformly from the newly expanded starts_set, and $N_\text{old}$ states sampled from the oldstarts buffer. This blending helps the policy learn from new, challenging states while reinforcing performance on states it has already mastered. The TrainPolicy function runs the chosen model-free RL algorithm (e.g., PPO) for $N_\text{TP}$ iterations using episodes starting from the sampled states and returns the policy and a map of success rates for the sampled initial states.
State Selection and Buffer Update: After training for $N_\text{TP}$ iterations, states from the sampled initial states with a success rate exceeding a threshold $C_\text{select}$ are added to the starts set for the next curriculum stage expansion. These successful states are also added to the oldstarts buffer to be used for future training and to prevent catastrophic forgetting. The Select function performs this update.
Stage Evaluation: The algorithm evaluates the policy's performance on the current starts_set (Evaluate function). If the fraction of states in starts_set from which the policy can reach the goal exceeds a threshold $C_\text{pass}$ , the current curriculum stage is considered mastered, and the algorithm proceeds to the next stage (back to step 2). Otherwise, the policy training loop (steps 3-5) continues on the current stage's starts_set.
Termination: The curriculum continues expanding and training until the starts_set includes states from the problem's true initial state distribution $\rho_0$ and the policy masters this stage, or until a desired performance metric from $\rho_0$ is reached.

Backward Reachable Sets (BRS)

A BRS for a target set and time horizon $T$ is the set of all initial states from which there exists some control policy that drives the system into the target set within time $T$ . The paper uses the Hamilton-Jacobi (HJ) formulation of reachability to compute BRSs. This involves solving a partial differential equation (PDE).

The core challenge with HJ reachability is the computational cost, which grows exponentially with state space dimension. To make BRS computation practical for robotics, the authors propose two key strategies:

Approximate Dynamics Model: Use a simplified, lower-dimensional, or linearized model of the system (the "curriculum model") for BRS computation, instead of the high-fidelity simulation model used for policy training.
System Decomposition: Break down the curriculum model's dynamics into smaller, overlapping subsystems (e.g., 1D or 2D). Compute BRSs for these subsystems and combine them (often as outer approximations) to get an approximate BRS for the full system. This allows using computationally efficient methods like those in the open-source helperOC and Level Set Methods MATLAB toolboxes.

The BRS computed this way provides a dynamically informed "frontier" from which to sample new training states. The set of initial states for a new curriculum stage is the union of the BRSs of the states marked as successful in the previous stage, backward for time $T$ . Sampling from the BRS is done via rejection sampling within bounding boxes around the BRS components.

Practical Implementation Considerations

Curriculum Model: The quality of the curriculum model affects BRS accuracy but doesn't need to be perfect. It should capture the core nonlinear dynamics relevant to reaching the goal. The paper shows robustness to significant model mismatch.
BRS Computation Efficiency: Decomposition is crucial. The specific decomposition depends on the system dynamics. For the car model, they decompose a 5D system into 4D subsystems, which are further decomposed. For the quadrotor, a 6D system is decomposed into 2D and 1D subsystems.
Hyperparameter Tuning: The hyperparameters ( $N_\text{new}, N_\text{old}, T, C_\text{pass}, C_\text{select}, N_\text{TP}$ ) are relatively intuitive. $T$ controls the difficulty increase per stage. $C_\text{select}$ and $C_\text{pass}$ define mastery. $N_\text{new}, N_\text{old}$ balance exploration of new states and consolidation on learned states. $N_\text{TP}$ depends on the inner RL algorithm's convergence speed. The authors report robustness to these settings.
State Representation: The BRS is computed in the curriculum model's state space, which may be a projection or subset of the simulator state space. A mapping is needed between the two.
Integration: BaRC is designed as a modular wrapper, requiring minimal modification to the chosen model-free RL algorithm. It primarily alters the initial state distribution used during training.

Experimental Results

BaRC was evaluated on two robotic environments:

5D Car Model: A standard non-holonomic car model with state $(x, y, \theta, v, \kappa)$ $(x, y, θ, v, κ)$ and controls $(a_v, a_\kappa)$ $(a_{v}, a_{κ})$ . The goal is a specific state with non-zero velocity. The reward is sparse (1.0 at the goal, 0 otherwise).
- BaRC rapidly increases the success rate from a diverse set of initial states, solving the task much faster than standard PPO and a random curriculum baseline (Bonomo et al., 2017 ).
- The average reward during BaRC training shows a healthy pattern of increasing (policy learning) and decreasing (curriculum expansion) reward, indicating effective pacing.
- BaRC demonstrates robustness to various model mismatches (velocity noise, control noise, oversteer).
Planar Quadrotor Model: A 6D planar quadrotor with state $(x, v_x, y, v_y, \phi, \omega)$ $(x, v_{x}, y, v_{y}, ϕ, ω)$ and thrust controls, navigating cluttered obstacles. Observations include state and 8 laser rangefinder readings. The goal is a target region $(x \ge 4, y \ge 4)$ $(x \geq 4, y \geq 4)$ . The reward is sparse (1000 at goal) with control costs and collision penalties. This is a highly dynamic and unstable system.
- Standard PPO and a random curriculum fail to reach the goal consistently, often learning local optima like hovering to avoid collisions. PPO with a smoothed quadratic reward also learns a sub-optimal local minimum.
- BaRC successfully learns a policy to reach the goal within a few curriculum iterations. The average reward plot again shows the characteristic learning and expansion phases.
- Although BRS computation adds overhead per iteration, the significant reduction in required RL iterations leads to substantial speedup in total wall clock time to task completion compared to baselines.

Conclusion

BaRC effectively addresses the sample efficiency challenge in sparse-reward robotic RL by intelligently shaping the training distribution using dynamically informed backward reachable sets computed with approximate models and decomposition techniques. It acts as a general wrapper around model-free RL algorithms, leveraging physical priors seamlessly. The experimental results show significant performance improvements in terms of sample complexity and wall clock time on representative dynamic robotic tasks, including those with unstable dynamics and sparse rewards that are intractable for standard methods. Future work includes exploring sampling-based BRS methods and integrating system identification to estimate curriculum models.

PDF Markdown

BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning (1806.06161v2)

Summary

Related Papers