Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks (1903.03878v1)

Published 9 Mar 2019 in cs.LG, cs.CV, cs.RO, and stat.ML

Abstract: Many robotic applications require the agent to perform long-horizon tasks in partially observable environments. In such applications, decision making at any step can depend on observations received far in the past. Hence, being able to properly memorize and utilize the long-term history is crucial. In this work, we propose a novel memory-based policy, named Scene Memory Transformer (SMT). The proposed policy embeds and adds each observation to a memory and uses the attention mechanism to exploit spatio-temporal dependencies. This model is generic and can be efficiently trained with reinforcement learning over long episodes. On a range of visual navigation tasks, SMT demonstrates superior performance to existing reactive and memory-based policies by a margin.

Citations (188)

View on Semantic Scholar

Summary

The paper presents a memory-based policy that uses an attention mechanism to encode and retain separate past observations for effective long-term navigation.
It demonstrates superior performance over RNN and reactive policies across tasks like roaming, coverage, and search in simulated indoor environments.
A novel memory factorization technique reduces the self-attention complexity from quadratic to linear, enabling scalable and efficient long-horizon task performance.

Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

The paper presents a memory-based policy architecture termed the Scene Memory Transformer (SMT), designed to enhance the performance of embodied agents in navigation tasks over long horizons in partially observable environments. These tasks often require the agents to make decisions based on observations received far in the past, thus necessitating robust mechanisms for memorizing and utilizing the long-term history of interactions with the environment.

Novelty and Model Description

The SMT distinguishes itself from traditional approaches by employing an attention mechanism, adapted from the Transformer models used in natural language processing, to effectively encode and aggregate past observations. Unlike recurrent neural network (RNN) policies, which amalgamate past observations into a fixed-size state vector, the SMT model maintains each observation separately in a scene memory. This allows the policy to retain all past observations without loss of potentially valuable information, which can be crucial for navigating complex environments.

The SMT comprises two main components: a memory module that embeds each observation into a memory set and a policy network that utilizes attention over the stored memory to formulate actions. The use of attention permits the model to efficiently learn and exploit spatio-temporal dependencies from the stored data without predetermining a fixed layout or structure of the environment, which is often the case with other memory models employing map-like representations.

Empirical Results and Performance

Empirical validation is carried out over three distinct navigation tasks: roaming, coverage, and search—all hosted within a simulated indoor environment dataset, SUNCG. The SMT policy demonstrates notable superiority over baseline policies, including reactive policies and memory-based policies implemented with RNNs or structured external memories. SMT consistently exhibits better performance in terms of task-specific metrics such as distance covered, area explored, and objects discovered.

The model's memory capacity is analyzed, revealing the significant benefits of the expansive memory retained by SMT, which can embed up to hundreds of past observations efficiently. This provides a substantial advantage in tasks that involve long sequences, where conventional memory mechanisms might falter due to computational constraints or optimization challenges.

Additionally, the paper introduces a memory factorization technique to address the computational complexity associated with self-attention, which is quadratic concerning the memory size. This method effectively reduces the complexity to linear by dynamically segmenting the memory into clusters and only processing representative clusters, or 'centers'. This innovation enables the SMT to leverage high memory capacity without incurring prohibitive computational costs, facilitating its scalability to extended task horizons.

Implications and Future Directions

The introduction of SMT opens avenues for more sophisticated and memory-efficient navigational strategies in embodied AI systems. Its ability to handle temporal and spatial dependencies without explicit map structures could inspire further developments in environments where traditional navigation systems struggle due to unpredictability or incomplete information.

Future research may explore integrating SMT with other modalities and sensory inputs beyond the visual domain to improve cross-modal reasoning. Additionally, potential real-world applications could be investigated by simulating SMT in real-time environments, possibly considering adjustments for handling real-world sensor and actuator noise more robustly.

The advancements presented by the SMT offer not only practical benefits for current navigation technologies but also contribute to a deeper theoretical understanding of memory usage and decision-making in AI agents, supporting the development of more autonomous and intelligent systems.

PDF Markdown