World Model as a Graph: Learning Latent Landmarks for Planning (2011.12491v3)

Published 25 Nov 2020 in cs.AI

Abstract: Planning - the ability to analyze the structure of a problem in the large and decompose it into interrelated subproblems - is a haLLMark of human intelligence. While deep reinforcement learning (RL) has shown great promise for solving relatively straightforward control tasks, it remains an open problem how to best incorporate planning into existing deep RL paradigms to handle increasingly complex environments. One prominent framework, Model-Based RL, learns a world model and plans using step-by-step virtual rollouts. This type of world model quickly diverges from reality when the planning horizon increases, thus struggling at long-horizon planning. How can we learn world models that endow agents with the ability to do temporally extended reasoning? In this work, we propose to learn graph-structured world models composed of sparse, multi-step transitions. We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph. In this same graph, the edges are the reachability estimates distilled from Q-functions. On a variety of high-dimensional continuous control tasks ranging from robotic manipulation to navigation, we demonstrate that our method, named L3P, significantly outperforms prior work, and is oftentimes the only method capable of leveraging both the robustness of model-free RL and generalization of graph-search algorithms. We believe our work is an important step towards scalable planning in reinforcement learning.

Authors (3)

Lunjun Zhang (8 papers)
Ge Yang (49 papers)
Bradly C. Stadie (11 papers)

Citations (69)

View on Semantic Scholar

Summary

World Model as a Graph: Learning Latent Landmarks for Planning

The paper "World Model as a Graph: Learning Latent Landmarks for Planning" presents a novel approach to integrating world models with planning in reinforcement learning through the use of latent graph structures. The authors propose a conceptualization of world models as graphs, with the introduction of the Greedy Latent Sparsification (GLS) algorithm to efficiently sample and utilize latent embeddings for clustering and planning. This technique is argued to enhance the robustness and efficacy of planning in environments necessitating longer-horizon reasoning.

Greedy Latent Sparsification (GLS)

GLS is at the core of the method developed in this paper. It enhances the clustering process by employing a greedy sampling strategy which selects latent embeddings that are maximally distant from each other within the latent space. The algorithm is inspired by the k-means++ initialization, aiming to improve the effectiveness of clustering in high-dimensional spaces. This sampling mechanism is critical in training the latent clusters utilized for planning during the inference phase, as it leads to a more expressive representation of the environment dynamics.

Graph Search with Soft Relaxations

The integration of graph search techniques adopts a modified version of the Floyd algorithm. Here, the authors substitute the traditional minimum operation with a soft minimum, implemented as a softmax function, to mitigate inconsistency often found in neural estimates of distances. This modification provides a more nuanced relaxation in the graph search, which is intended to yield a more reliable understanding of the global structure and allows for the computation of extended distances without aggressive overestimation.

Overarching Training Methodology

The paper explores the comprehensive training architecture employed to integrate the proposed methods effectively, detailing the initialization and iterative update processes across the latent space graph structure, policy, value, and distance functions. The training schedule incorporates episodic sampling and gradient-update strategies to fine-tune these models, using centralized replay buffers to maintain sample efficiency across training iterations.

Practical Observations and Implementation Details

The paper reports various insights gained during implementation across different environments, such as Ant-Maze and Fetch tasks. Key hyper-parameters identified include the ratio of environment steps to gradient steps, the clipping values for gradient norms to ensure stability, and strategic use of GLS for initial and exploratory landmark placements. The results indicate enhanced sample efficiency and robustness compared to traditional methods.

Implications and Future Directions

The proposed methodology of using world models as graph structures offers a promising avenue for improving planning in complex, high-dimensional reinforcement learning tasks. By effectively representing the environment as a latent graph, this approach has potential applications in improving decision-making in robotics and similar fields where spatial awareness and long-horizon planning are paramount. Future work could explore refining the GLS algorithm's efficiency, tackling real-time applications, and further integrating with other model-based reinforcement learning strategies to expand the utility of latent landmark planning.

In conclusion, the paper offers a valuable contribution to merging world model concepts and planning strategies, leveraging graph-based representations to potentially enrich the richness and applicability of reinforcement learning solutions in complex domains.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos