Model-free Reinforcement Learning for Robust Locomotion using Demonstrations from Trajectory Optimization (2107.06629v2)

Published 14 Jul 2021 in cs.RO, cs.AI, and cs.LG

Abstract: We present a general, two-stage reinforcement learning approach to create robust policies that can be deployed on real robots without any additional training using a single demonstration generated by trajectory optimization. The demonstration is used in the first stage as a starting point to facilitate initial exploration. In the second stage, the relevant task reward is optimized directly and a policy robust to environment uncertainties is computed. We demonstrate and examine in detail the performance and robustness of our approach on highly dynamic hopping and bounding tasks on a quadruped robot.

Citations (30)

View on Semantic Scholar

Summary

The paper introduces a two-stage framework that uses trajectory optimization demonstrations to bootstrap deep reinforcement learning for robust quadruped locomotion.
The approach first employs efficient multicontact pattern generation to design motion trajectories, then refines policies by removing time-dependence via DRL.
Experimental results show the trained policies maintain performance and recover effectively from dynamic perturbations and environmental uncertainties.

Overview of Model-Free Reinforcement Learning for Robust Locomotion Using Demonstrations from Trajectory Optimization

This paper proposes a robust model-free reinforcement learning framework aimed at generating control policies for dynamic locomotion tasks in legged robots, specifically focusing on a quadruped robot. The approach uniquely integrates trajectory optimization (TO) and deep reinforcement learning (DRL) to develop policies that are not only efficient but robust against uncertainties inherent in environmental interactions and contact dynamics.

Core Methodology

The methodology is bifurcated into two primary stages:

Trajectory Optimization for Demonstration: Initially, trajectory optimization is employed to derive a nominal motion trajectory for the task. This trajectory acts as a crucial demonstration for DRL, guiding initial exploration. Notably, this method leverages efficient multicontact pattern generation techniques, allowing it to circumvent the complexities typically associated with formulating high-dimensional and nonlinear optimization problems in real-time.
Deep Reinforcement Learning with Robust Policy Formation: Based upon the initial trajectory, DRL is tasked with formulating control policies. This occurs over two sub-stages:
- First, the learning system is trained to replicate the demonstrated trajectory, encoding the behavior in a neural network model.
- Next, the policy transitions to being task-optimized, enhancing robustness under realistic environmental uncertainties such as varied contact timing and surface characteristics. Significantly, the second phase involves removing any original time-dependence, allowing adaptability and increased robustness in varied real-world applications.

Experimental Evaluation

The approach has been experimentally validated for dynamic tasks such as hopping and bounding using the Solo8 quadruped robot. TO-generated trajectories were utilized, guiding DRL to achieve adept performances across simulations and real settings. During deployment, the output policies demonstrated significant robustness, recovering adeptly from perturbations and terrain-induced biases, an impressive trait unattained by many existing DRL approaches.

Numerics of Performance: The experimental setup included challenges like uneven surfaces and dynamic shocks. The results highlighted that the trained policies can accommodate a range of initial conditions and disturbances, maintaining coherent task execution across scenarios.
Adaptability in Motion Variation: The framework allows modifying task accomplishments (e.g., altering the height of hops) just by adjusting specific reward parameters in the task definition. This adaptability underlines the core advantage of using a dual-framework amalgamation of TO and DRL.

Theoretical Implications

The integration of trajectory optimization with reinforcement learning offers insightful theoretical enhancements in learning-driven control policies. This two-stage process demonstrates that the hybridization of trajectory-centric demonstrations with a stochastic optimization paradigm can create not only task-pertinent behaviors efficiently but also robustly accommodate uncertainties typically challenging in purely DRL models.

Practical Implications and Future Directions

Pragmatically, this framework could be pivotal in advancing the capabilities of autonomous robots operating in unpredictable environmental conditions, making deployment in scenarios like search and rescue operations more plausible. Moreover, extending this approach to other locomotion forms and robots—like bipedal systems or intricate manipulations—presents a promising research trajectory. Continued exploration into policy optimizations accounting for broader autonomy-affecting variabilities could further extend this paper’s contributions.

Conclusion

This paper delivers an innovative paradigm by synthesizing model-based and model-free controls to robustly tackle environmental uncertainties in legged locomotion. The dual reinforcement learning stages provide efficient solutions, minimizing exploration issues while fully harnessing demonstration-derived trajectory knowledge to optimize contact-rich dynamic interactions. This approach is a haLLMark for advancing reinforcement learning methodologies towards real-world deployments in dynamic and unpredictable scenarios for legged robots.

PDF Markdown

Related Papers

YouTube

Show All Videos