Skill-based Model-based Reinforcement Learning

Published 15 Jul 2022 in cs.LG, cs.AI, and cs.RO | (2207.07560v2)

Abstract: Model-based reinforcement learning (RL) is a sample-efficient way of learning complex behaviors by leveraging a learned single-step dynamics model to plan actions in imagination. However, planning every action for long-horizon tasks is not practical, akin to a human planning out every muscle movement. Instead, humans efficiently plan with high-level skills to solve complex tasks. From this intuition, we propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step. For accurate and efficient long-term planning, we jointly learn the skill dynamics model and a skill repertoire from prior experience. We then harness the learned skill dynamics model to accurately simulate and plan over long horizons in the skill space, which enables efficient downstream learning of long-horizon, sparse reward tasks. Experimental results in navigation and manipulation domains show that SkiMo extends the temporal horizon of model-based approaches and improves the sample efficiency for both model-based RL and skill-based RL. Code and videos are available at https://clvrai.com/skimo

Abstract PDF Upgrade to Chat

Citations (38)

View on Semantic Scholar

Summary

The paper introduces SkiMo, a framework that uses a skill dynamics model to predict long-term outcomes, enhancing planning in reinforcement learning.
The study employs joint pre-training of skills and the dynamics model to achieve superior performance in navigation and manipulation tasks.
Experimental results show that SkiMo significantly improves sample efficiency and reliability in complex, sparse-reward environments.

Skill-based Model-based Reinforcement Learning

Introduction

The paper "Skill-based Model-based Reinforcement Learning" presents SkiMo, a framework designed to improve sample efficiency in reinforcement learning (RL) by combining model-based and skill-based approaches. This framework draws inspiration from human intelligence, where high-level skills rather than low-level actions are used for planning. SkiMo leverages a skill dynamics model to predict skill outcomes directly, enhancing the ability to perform accurate long-term planning. This paper demonstrates significant improvements in sample efficiency over existing model-based and skill-based RL techniques, particularly in tasks that require long-horizon and sparse reward structures.

Figure 1: Intelligent agents can use their internal models to imagine potential futures for planning, leading to improved predictions and planning efficiency.

Methodology

Skill Dynamics Model

SkiMo introduces a skill dynamics model, which is a cornerstone of the proposed methodology. Instead of simulating every possible low-level action, the skill dynamics model predicts the state after executing a sequence of high-level skills. This approach significantly reduces the number of predictions required, thereby minimizing error accumulation and enhancing long-term prediction accuracy.

Figure 2: SkiMo combines model-based RL and skill-based RL for efficient learning of long-horizon tasks, leveraging learned skills and skill dynamics.

Pre-Training Phase

During pre-training, SkiMo extracts a skill repertoire and a skill dynamics model from task-agnostic, offline data. The joint training of skills and the skill dynamics model shapes the skill embedding space, allowing for efficient prediction and skill execution. This method contrasts with others that train these components separately, suggesting that integration leads to more cohesive learning outcomes.

Figure 3: Pretraining in SkiMo jointly trains the model and policy to derive an effective skill space for planning.

Experimental Results

SkiMo was tested against several baseline algorithms across multiple domains, including navigation and manipulation tasks. The experimental results show that SkiMo consistently outperforms state-of-the-art RL methods in sample efficiency and effectiveness on tasks with long horizons and sparse rewards. For instance, in complex maze navigation tasks, SkiMo achieved more reliable goal attainment due to its superior planning and prediction techniques.

Figure 4: SkiMo outperforms other approaches in completing long-horizon tasks efficiently.

Implications and Future Directions

The implications of this research are significant for practical RL applications, particularly in robotics and automation, where tasks are typically long-horizon with sparse feedback. The ability to plan using temporally abstracted skills while maintaining high prediction accuracy opens pathways for deploying RL in complex real-world scenarios. Future work may explore extending SkiMo to real robots, where tasks involve high-dimensional sensory inputs such as RGB images and tactile feedback. Additionally, developing flexible semantic skill extraction could enhance planning capabilities further.

Conclusion

SkiMo presents a robust framework for skill-based and model-based RL, effectively merging the strengths of both approaches. The innovation of using a skill dynamics model for long-term planning marks a pivotal advancement in RL methodologies. By optimizing skill learning and execution, SkiMo offers a pathway towards more efficient and scalable RL, suitable for complex problem-solving in dynamic environments.