Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction (1809.00786v2)

Published 4 Sep 2018 in cs.CL

Abstract: We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions. Our evaluation demonstrates the advantages of our model decomposition, and illustrates the challenges posed by our new benchmarks.

Citations (179)

View on Semantic Scholar

Summary

The paper introduces a novel decomposition method that separates goal prediction from action generation to improve instruction execution in 3D settings.
Researchers deploy LingUNet—a language-conditioned adaptation of U-Net—to translate raw visual data into interpretable goals without handcrafted representations.
Evaluations on the Lani and Chai benchmarks demonstrate enhanced navigation performance and reveal challenges in achieving complex, multi-step actions.

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

The paper delineates a novel approach to instruction execution in interactive 3D environments, emphasizing a decomposition into goal prediction and action generation. The researchers designed a model that effectively maps raw visual observations to goals using the proposed LingUNet, which is a language-conditioned image generation network, subsequently leveraging this mapping to generate the required actions to achieve these goals.

Summary of Approach

In contrast to models that attempt to learn a direct mapping from inputs to actions, this paper argues for a separation of concerns that distinguishes between goal identification and the generation of actions needed to reach that goal. This separation is implemented through LingUNet, an architectural innovation built on the U-Net image-to-image architecture, adapted to condition image generation on linguistic input. Using visual goal predictions, the model generates actions via a recurrent neural network (RNN), enabling the system to operate independently of handcrafted intermediate symbolic representations and pretrained parsers. Training relies solely on demonstrations, avoiding any external resource dependence.

Benchmarks and Experiments

The efficacy of this approach was evaluated using two newly introduced benchmarks, Lani and Chai, which serve as testbeds for instruction following in different complexity settings. Lani engages an agent in a landmark navigation task within a 3D environment. The Chai benchmark, set in a simulated household, extends the task to include object manipulation. Lani, with its 6,000 sequences and average instruction count of 4.7 per sequence, ensures a singular focus on navigation. Chai complicates matters with sequences requiring spatial and temporal reasoning due to multiple intermediate goals and varied action types.

Experimental analyses demonstrate an improved performance due to goal-action decomposition, particularly evident in the significant advancement in success rates over recent methods in Lani. In contrast, the complexity inherent in Chai tasks rendered overall results weaker, underscoring the persistent challenge posed by multifaceted goals that necessitate intricate actions. Despite these complexities, results consistently positioned the new method ahead of comparable efforts.

Numerical Results

Quantitative results from the experiments reflect the advantages of the model's decomposition. On Lani tasks, the model achieved a task completion (TC) accuracy of 36.9%, compared to the highest baseline performance of 31%. The approach also recorded an improvement in stop distance error (SD), reducing it to 8.43. The Chai benchmark highlighted the challenges in manipulation accuracy, retaining a 39.97% accuracy rate, comparable to the baseline, yet still significant given the complexity of manipulation tasks.

Implications and Future Directions

The implications of this research are significant for the field of AI-driven navigation and manipulation in complex environments. By enabling interpretable visual goal prediction without pre-designed ontologies, this approach allows more flexibility in application across different domains. Furthermore, the separation of goal prediction allows for methodological tailoring, wherein goal prediction can benefit from supervised learning while action generation leverages reinforcement learning techniques, which optimizes performance through exploration.

The future of AI research could see extensions of this method enhancing generalizability and robustness, particularly by addressing cascading error issues and execution constraints specified within instructions. Further studies might focus on integrating more complex reasoning capabilities directly into action generation models, a step towards holistic understanding and execution in AI systems.

Overall, this paper contributes significant insights to the understanding of instruction following strategies in AI, presenting a promising avenue for enhancing the autonomy and interpretability of agents in highly interactive and variable environments.

PDF Markdown

Related Papers

GitHub

GitHub - lil-lab/ciff: Cornell Instruction Following Framework (34 stars)