Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
146 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features (2410.13002v2)

Published 16 Oct 2024 in cs.RO and cs.AI

Abstract: End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision LLMs (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning on a small simulated dataset successfully generalize to real-world scenes with diverse novel goals and command formulations.

Summary

We haven't generated a summary for this paper yet.