OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning (2505.11917v1)

Published 17 May 2025 in cs.RO

Abstract: General-purpose robots capable of performing diverse tasks require synergistic reasoning and acting capabilities. However, recent dual-system approaches, which separate high-level reasoning from low-level acting, often suffer from challenges such as limited mutual understanding of capabilities between systems and latency issues. This paper introduces OneTwoVLA, a single unified vision-language-action model that can perform both acting (System One) and reasoning (System Two). Crucially, OneTwoVLA adaptively switches between two modes: explicitly reasoning at critical moments during task execution, and generating actions based on the most recent reasoning at other times. To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable pipeline for synthesizing embodied reasoning-centric vision-language data, used for co-training with robot data. We validate OneTwoVLA's effectiveness through extensive experiments, highlighting its superior performance across four key capabilities: long-horizon task planning, error detection and recovery, natural human-robot interaction, and generalizable visual grounding, enabling the model to perform long-horizon, highly dexterous manipulation tasks such as making hotpot or mixing cocktails.

Summary

Overview of OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

The paper introduces OneTwoVLA, a novel unified Vision-Language-Action (VLA) model designed to enhance robotic capabilities in task execution through synergistic reasoning and acting. This approach distinguishes itself by integrating both acting and reasoning within one model, facilitating adaptive reasoning during task execution. Addressing significant limitations of dual-system approaches, such as lack of mutual awareness and latency issues, OneTwoVLA determines autonomously when to switch between reasoning and acting modes. The design targets key robotic challenges, including long-horizon task planning, error detection and recovery, natural human-robot interaction, and generalizable visual grounding.

Methodology

OneTwoVLA operates by interleaving reasoning and action within one framework, adapting to external inputs and task demands. Reasoning occurs at crucial junctures—completion of subtasks, error detection, or when interactions with humans are necessary—and is leveraged to inform subsequent actions. The methodology includes:

Adaptive Reasoning: The model switches to reasoning mode to generate scene descriptions, task plans, historical summaries, and next-step instructions.
Action Generation: Actions are produced based on the latest reasoning outputs to facilitate task execution.

To strengthen the reasoning and generalization capabilities, the authors developed a scalable pipeline for synthesizing embodied reasoning-centric vision-language data, which is co-trained with robot data. The training schema ensures reinforcement of both high-level reasoning and precise low-level actions.

Experimental Validation

Extensive experiments validate OneTwoVLA's effectiveness across several benchmarks:

Long-horizon Task Planning: OneTwoVLA significantly outperforms baselines, achieving an average success rate of 87% across tasks that demand complex planning and dynamic adjustments.
Error Detection and Recovery: The model identifies and responds to execution errors in real-time, reasoning about and implementing corrective strategies efficiently.
Human-Robot Interaction: OneTwoVLA adjusts actions following human intervention and seeks clarification proactively when ambiguities arise. This contrasts sharply with baselines that struggle with context retention.
Generalizable Visual Grounding: Incorporating synthetic vision-language data improves its visual grounding capabilities, handling spatial relationships, object attributes, and semantic features effectively. OneTwoVLA shows robust performance even with unfamiliar objects, indicating strong generalization capabilities due to the integration of diverse vision-language data.

Implications and Future Directions

The implications of OneTwoVLA's approach are substantial for both practical applications and theoretical advancements in AI:

Practical Implications: The unified model's ability to adapt reasoning and acting capabilities enhances real-world robotic applications—expanding interactions in human-centric environments, improving safety and efficiency, and enabling continuous operation despite dynamic changes.
Theoretical Implications: A unified model proposing strong synergy between thought and action challenges established dual-system paradigms, inviting reconsideration of cognitive AI models and their applications in robotics.

Future research may address current limitations, such as investigating asynchronous architectures to parallelize reasoning and acting further, improving the reasoning process with reinforcement learning techniques, and exploring the impact of vision-language data from diverse sources on model capabilities. Enhancing reasoning precision could substantially aid in error-free execution and broaden the applications of AI-driven robotics.

The development of OneTwoVLA marks a step forward towards more autonomous and intelligent robotic systems, setting the foundation for future explorations in unified model architectures and their deployment at scale.

Tweets

https://twitter.com/gao_young/status/1924796129813680177