Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution (2009.14108v2)

Published 29 Sep 2020 in cs.LG, cs.AI, and stat.ML

Abstract: Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at https://github.com/ml-jku/align-rudder. YouTube: https://youtu.be/HO-_8ZUl-UY

Citations (41)

View on Semantic Scholar

Summary

The paper presents a novel RL method that redistributes rewards using sequence alignment to learn effectively from few high-reward demonstrations.
It leverages multiple sequence alignment to identify and emphasize critical events, outperforming state-of-the-art approaches in sparse and delayed reward settings.
Experimental results in gridworld and Minecraft demonstrate superior efficiency, highlighting potential advancements in hierarchical reinforcement learning.

Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

The research paper presents Align-RUDDER, a novel reinforcement learning (RL) algorithm specifically designed to overcome challenges inherent in tasks with sparse and delayed rewards, utilizing only a few high-reward demonstrations. Align-RUDDER is an extension of the RUDDER framework, which improves learning efficiency by redistributing rewards to critical steps in the task sequence. In this method, sequence alignment techniques common in bioinformatics are repurposed to enable effective reward redistribution, even when limited demonstrations are available. The paper provides compelling evidence that Align-RUDDER excels in such scenarios, outperforming existing state-of-the-art methods.

Approach and Methodology

Align-RUDDER leverages multiple sequence alignment (MSA) to derive a profile model from high-reward demonstrations. This approach contrasts with traditional deep learning methods like LSTMs, which typically require vast amounts of data to generalize effectively. By identifying relevant events and redistributing rewards to those key sequence elements, Align-RUDDER frames reward redistribution as a sequence alignment problem, allowing it to operate efficiently with minimal training data.

The core steps of Align-RUDDER's reward redistribution process are outlined as follows:

Defining Events: High-reward demonstrations are transformed into sequences of events using clustering techniques to detect significant changes in state-actions.
Determining the Scoring System: A scoring matrix is devised to favor aligning relevant events, which encourages the alignment of rare yet significant events.
Multiple Sequence Alignment (MSA): The MSA algorithm aligns multiple demonstrations, creating a consensus sequence that highlights shared strategy elements across different episodes.
Profile Model and PSSM: A Position-Specific Scoring Matrix (PSSM) is constructed from the alignment, serving as a tool to score new sequence alignments.
Reward Redistribution: Using the consensus sequence and PSSM, rewards are redistributed along sequences to emphasize sub-tasks corresponding to high-reward sub-goals.

Experimental Results

The paper validates Align-RUDDER's efficacy through experiments on both artificial tasks and the complex Minecraft ObtainDiamond task. The artificial tasks were gridworld variations structured with delayed rewards and minimal demo data, where Align-RUDDER consistently outperformed methods like BC+Q-Learning, DQfD, and SQIL across multiple demonstrations and stochastic environments. In the dynamic and intricate environment of Minecraft, Align-RUDDER succeeded in mining diamonds, one of the first learning-based methods to achieve this without relying on extensive intermediate rewards.

Theoretical and Practical Implications

Align-RUDDER not only demonstrates a significant practical advantage in specific RL tasks but also makes theoretical contributions to the field. It extends the traditional RUDDER framework by incorporating sequence alignment, improving the efficiency of reward redistribution under sparse demonstration conditions. The methodology can catalyze advancements in hierarchical reinforcement learning, particularly for decomposing tasks into manageable sub-tasks through the alignment of relevant events.

Conclusion and Future Directions

Align-RUDDER offers a promising direction for reinforcement learning research, particularly in settings with limited high-reward episodes. By effectively marrying sequence alignment with RL, it addresses existing challenges in sparse and delayed reward configurations. Future research may explore extending the sequence alignment approach to other domains and further refining the methodology to handle diverse RL environments and tasks efficiently. The application of profile models from bioinformatics to RL opens new paradigms for interpretation and generalization in AI systems, potentially enriching hierarchical task understanding and more robust model architectures.

PDF Markdown

Related Papers

GitHub

GitHub - ml-jku/align-rudder: Code to reproduce results on toy tasks and companion blog for the paper. (20 stars)

YouTube

Show All Videos