Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery (1907.08225v4)

Published 18 Jul 2019 in cs.LG, cs.AI, cs.CV, cs.RO, and stat.ML

Abstract: Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a real-world robot and in simulation. We show that our method can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: https://sites.google.com/view/dynamical-distance-learning.

PDF Abstract

Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery

The paper "Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery" proposes an innovative approach to address key challenges in reinforcement learning (RL), specifically the manual design of reward functions. It introduces Dynamical Distance Learning (DDL), which automatically learns a measure of the expected time steps required to reach one state from another, termed as "dynamical distances." These distances effectively shape reward functions, facilitating the learning of complex tasks.

Summary of Core Contributions

DDL's main contribution is its ability to learn distance functions that do not rely on manually engineered reward functions. This provides several advantages:

Reward Shaping: DDL uses learned dynamical distances to shape rewards, facilitating smoother convergence during RL training compared to traditional sparse reward methods.
Semi-Supervised Learning: DDL operates in a semi-supervised setting where unlabeled interaction data is used to learn distances. A small set of preference labels then specifies task goals, eliminating the need for extensive human-labeled data.
Unsupervised Skill Discovery: In a fully unsupervised regime, DDL can autonomously discover skills by exploring states that are dynamically distant from a starting state, enhancing exploration strategies.
Practical Implementation: The implementation is made robust through supervised regression techniques to estimate dynamical distances, which is articulated as more reliable than typical model-free RL methods that rely heavily on temporal difference updates.

The authors showcase DDL's capability on both real-world and simulated environments. A standout result is its application to a robotic hand that successfully learns to rotate a valve 180 degrees using raw visual input and only 10 preference labels.

Results and Numerical Insights

The authors provide empirical evidence that the proposed method enhances learning efficiency compared to previous methods. Specifically, when evaluated on a 9-DoF hand in a robotic manipulation task, DDL performed comparably to traditional methods that had access to the ground truth reward but without such access. In simulated environments, DDL demonstrated its robustness, with its semi-supervised variant solving tasks with limited user input and its unsupervised variant discovering effective skills autonomously.

Implications and Future Directions

From a theoretical perspective, DDL emphasizes the feasibility of using dynamical distances to address reward specification challenges in RL, shifting the paradigm from manually crafting rewards to learning them through interaction. Its implications span skill learning to more efficient exploration, critically impacting how autonomous systems are trained.

Practically, this method offers significant advancements for robotic applications, where designing reward functions is non-trivial. In future AI systems, leveraging dynamical distances can vastly improve problem-solving capabilities in uncertain and high-dimensional spaces.

A potential development is in expanding DDL for multi-task learning, where a single dynamical distance model could support diverse tasks. Another avenue is exploring the extension of DDL in stochastic environments, where dynamics include noise, to see how well the deterministic assumption holds.

In conclusion, DDL provides a promising and practical framework for overcoming reward design challenges in RL. It pushes the boundary towards more autonomous learning systems that effectively generalize across different domains and applications, advancing both the theoretical and applied branches of artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Kristian Hartikainen (10 papers)
Xinyang Geng (21 papers)
Tuomas Haarnoja (16 papers)
Sergey Levine (531 papers)

Citations (72)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos