Emergence of Locomotion Behaviours in Rich Environments (1707.02286v2)

Published 7 Jul 2017 in cs.AI

Abstract: The reinforcement learning paradigm allows, in principle, for complex behaviours to be learned directly from simple reward signals. In practice, however, it is common to carefully hand-design the reward function to encourage a particular solution, or to derive it from demonstration data. In this paper explore how a rich environment can help to promote the learning of complex behavior. Specifically, we train agents in diverse environmental contexts, and find that this encourages the emergence of robust behaviours that perform well across a suite of tasks. We demonstrate this principle for locomotion -- behaviours that are known for their sensitivity to the choice of reward. We train several simulated bodies on a diverse set of challenging terrains and obstacles, using a simple reward function based on forward progress. Using a novel scalable variant of policy gradient reinforcement learning, our agents learn to run, jump, crouch and turn as required by the environment without explicit reward-based guidance. A visual depiction of highlights of the learned behavior can be viewed following https://youtu.be/hx_bgoTF7bs .

Authors (12)

Nicolas Heess (139 papers)
Dhruva TB (5 papers)
Srinivasan Sriram (1 paper)
Jay Lemmon (3 papers)
Josh Merel (31 papers)
Greg Wayne (33 papers)
Yuval Tassa (31 papers)
Tom Erez (20 papers)
Ziyu Wang (137 papers)
S. M. Ali Eslami (33 papers)
Martin Riedmiller (64 papers)
David Silver (67 papers)

Citations (891)

View on Semantic Scholar

Summary

The paper demonstrates that diverse, rich environments enable the emergence of complex locomotion behaviors from simple rewards.
The study employs distributed proximal policy optimization to efficiently train various agent morphologies in high-dimensional continuous control tasks.
Empirical results reveal that agents acquire robust skills like running, jumping, and turning, enhancing adaptability in unseen conditions.

Emergence of Locomotion Behaviours in Rich Environments

The paper "Emergence of Locomotion Behaviours in Rich Environments" investigates the potential for complex locomotion behaviors to emerge from simple reward functions when agents are trained in diverse and challenging environments. This work explores the hypothesis that environmental richness and diversity can compensate for the lack of intricate reward engineering, a common practice in reinforcement learning (RL) for continuous control tasks.

Introduction

Reinforcement learning has shown substantial success in domains where the reward functions are well-defined and aligned with the objectives, such as video games and board games. However, in continuous control tasks like locomotion, reward functions often need to be meticulously hand-crafted. This requirement poses fundamental challenges to the RL paradigm, questioning its efficacy in more generalized settings.

This paper aims to return to the core challenge of RL: enabling an agent to develop complex behaviors from simple rewards. It does so by introducing rich environments with varying levels of complexity and difficulty, and by validating the learned behaviors through empirical studies on novel locomotion tasks.

Methodology

The authors employ a variety of simulated environments to train agents using several body morphologies: a Planar Walker, a Quadruped, and a 3D Humanoid. These environments incorporate procedurally generated terrains featuring diverse obstacles such as gaps, hurdles, uneven terrains, slalom walls, and platforms.

The reward function is kept deliberately simple to emphasize the role of environmental complexity. For example, the primary component of the reward is forward progress, incentivized by a velocity term, with minor penalties for deviations and actuator efforts.

Distributed Proximal Policy Optimization

To efficiently train agents, the paper introduces a distributed version of Proximal Policy Optimization (PPO), termed Distributed PPO (DPPO). This algorithm extends PPO’s capability to large-scale, high-dimensional continuous control problems by leveraging distributed computation. The DPPO algorithm synchronizes the policy updates across multiple workers, ensuring scalability and robustness.

Results

The experiments demonstrate that agents trained in these environments develop robust locomotion skills such as running, jumping, crouching, and turning, without explicit reward engineering for each behavior. The paper reports on the efficacy of learning from environments with implicit curricula — terrains where the difficulty gradually increases. This approach accelerates learning and enhances performance compared to environments with static difficulty.

Empirical evidence shows that the agents trained in diversified environments exhibit higher robustness and adaptability across unseen variations in terrain conditions, such as changes in ground friction, rumble strips, actuator strengths, and inclines.

Analysis

The paper presents strong numerical results, particularly highlighting:

Planar Walker's ability to jump over hurdles nearly as tall as its own body.
Quadruped’s competence in navigating through a combination of obstacles and varied terrains.
Humanoid’s success in acquiring sophisticated gaits that enable it to navigate through hurdles, gaps, and slalom walls.

Further, comparative analysis shows that training on varied terrains significantly enhances the robustness of the learned policies, reducing the likelihood of overfitting to specific idiosyncratic solutions.

Implications and Future Directions

The findings suggest that training agents in rich and varied environments can negate the necessity for complex reward engineering, facilitating the emergence of sophisticated behaviors from simple reward functions. This paradigm shift has significant implications for developing more generalized RL systems capable of handling diverse real-world tasks without extensive domain-specific modifications.

Theoretical implications include the potential for learning paradigms that prioritize environmental complexity and diversity over reward specificity. Practically, this approach could streamline the development of RL systems in robotic applications, where specifying detailed rewards for each desired behavior is often infeasible.

Future research could explore further optimizations in the curriculum design, more diverse and complex environments, and extending these methodologies to additional robotic tasks beyond locomotion. Additionally, integrating these findings with current advancements in hierarchical and multi-agent RL could unlock new levels of performance and capability in autonomously learning agents.

The work provides a compelling case for the strategic design of training environments to foster sophisticated and robust behaviors, marking a step forward in the field of reinforcement learning.

PDF Markdown

Related Papers

YouTube

Show All Videos