Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

110 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

229

Learning to Act without Actions (2312.10812v2)

Published 17 Dec 2023 in cs.LG and cs.AI

Abstract: Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in domains such as language and vision. However, this paradigm has not yet taken hold in reinforcement learning. This is because videos, the most abundant form of embodied behavioral data on the web, lack the action labels required by existing methods for imitating behavior from demonstrations. We introduce Latent Action Policies (LAPO), a method for recovering latent action information, and thereby latent-action policies, world models, and inverse dynamics models, purely from videos. LAPO is the first method able to recover the structure of the true action space just from observed dynamics, even in challenging procedurally-generated environments. LAPO enables training latent-action policies that can be rapidly fine-tuned into expert-level policies, either offline using a small action-labeled dataset, or online with rewards. LAPO takes a first step towards pre-training powerful, generalist policies and world models on the vast amounts of videos readily available on the web.

References (50)

Authors (2)

Dominik Schmidt (7 papers)
Minqi Jiang (31 papers)

Citations (17)

View on Semantic Scholar

Summary

The paper introduces LAPO, a novel method that infers latent actions from observational data to enable effective policy learning.
It integrates an inverse dynamics model with a forward dynamics model in a vector-quantized latent space to predict state transitions.
Experimental results on the Procgen Benchmark demonstrate that LAPO learns interpretable action spaces and enables rapid fine-tuning to achieve expert-level performance.

Introduction

Deep Reinforcement Learning (RL), characterized by its ability to learn policies for complex tasks, commonly relies on a training regimen that requires detailed action or reward labels in the context of the task being learned. However, massive amounts of potential training data come in the form of action-free observations, such as internet videos, which lack the explicit labels necessary for traditional learning frameworks.

Learning from Observations Alone

In this context, the paper introduces a new method called Latent Action Policies from Observation (LAPO), which enables learning directly from action-free demonstrations. LAPO innovatively infers latent actions that can be translated into policies without the need for explicit action or reward labels, effectively learning from pure observations. Through an unsupervised training process, LAPO interlinks an inverse dynamics model (IDM), which estimates latent actions between observed states, with a forward dynamics model (FDM), which predicts future observations based on past states and inferred actions. The process is tied together with a vector-quantized latent space that provides an information bottleneck, ensuring that the latent actions contain essential information for predicting state transitions.

Experimental Results

The efficacy of LAPO was tested in procedurally-generated environments from the Procgen Benchmark. Results demonstrated that LAPO could learn interpretable latent action spaces with structures resembling true action spaces, despite having no access to explicit action information. Once the latent space policy is trained through behavior cloning from these latent actions, it can be readily fine-tuned using standard RL methods to achieve rapid adaptation and expert-level performance.

Conclusion and Future Directions

This research signifies an important advancement in the use of immense action-free datasets for pre-training and rapidly adapting RL policies. By translating the unsupervised pre-training paradigm seen in the fields of language and vision to the field of RL with LAPO, new possibilities open up for going beyond the constraints of labeled datasets. Scaling up LAPO to handle more complex, multi-task environments remain a promising avenue for future exploration, laying the groundwork for extracting rich behavioral knowledge from vast reservoirs of observational data.

PDF Markdown

Tweets

https://twitter.com/schmidtdominik_/status/1777662064401481855

https://twitter.com/schmidtdominik_/status/1838708030193443041

https://twitter.com/22146921/status/1738928846546825346

https://twitter.com/bellmantd/status/1936131743674753214

https://twitter.com/123543935/status/1737122524729671994

https://twitter.com/garridoq_/status/1762051978564100272