Bi-VLA: Vision-Language-Action Model

Updated 6 July 2025

Bi-VLA model is a multimodal system that integrates vision, language, and action for dexterous bimanual robotic manipulation.
It decouples task planning, perception, and control into distinct modules that translate natural language into executable robotic actions.
Empirical results show 100% planning accuracy and 83.4% overall success, demonstrating both its potential and limitations in visual perception.

The Bi-VLA (Vision-Language-Action) model denotes a class of multimodal systems designed for dexterous robotic manipulation, integrating vision as a basis for scene understanding, language as an interface for semantic task specification and program synthesis, and action modules for precise, real-world robotic control. As formalized in the recent work "Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations" (Gbagbe et al., 9 May 2024), Bi-VLA systems are architected to translate human instructions into executable code, perceive complex environments, map visual data to actionable coordinates, and orchestrate bimanual robotic behaviors in a goal-driven manner.

1. System Architecture

The Bi-VLA architecture comprises three core modules:

Language Module: Handles task planning, code generation, and contextual adaptation of human instructions.
Vision-Language Module: Performs perceptual verification, object detection, and transforms 2D vision outputs into 3D actionable world coordinates.
Action Module: Executes the planned and synthesized code, coordinating real-time control over bimanual manipulators.

Each module is functionally decoupled but tightly integrated via shared interfaces, whereby language prompts initiate workflow, visual perception provides environmental grounding, and physical actions are programmed through API-level abstractions.

2. Language Understanding and Code Generation

Upon receiving a human instruction (e.g., “Prepare a salad with tomatoes, lettuce, and cheese”), the system's LLM—in this context, Starling‑LM‑7B‑alpha—constructs a high-level semantic plan. This plan is demarcated using structured tags [start of plan] ... [end of plan], ensuring unambiguous parsing for downstream robotic control. The LLM-based Code Generator then compiles the plan into a set of executable Python API calls, which directly correspond to manipulation primitives such as:

open_gripper(manipulator_type)
move_to_object(manipulator_type, object_name)
grasp(manipulator_type, object_name)
cut(manipulator_type, object_name)
pour(manipulator_type, object_name)
put(manipulator_type, location)
toss(manipulator_type, object_name)

These actions are invoked sequentially to fulfill complex recipes and accommodate varying task objectives as specified by user intent.

3. Vision-Language Processing and Pixel-to-World Mapping

The Vision-Language Module employs the Qwen-VL model, a large-scale vision-language pre-trained system, to perform multi-object detection, classification, and caption verification using image streams from an on-arm camera. Detected items' pixel coordinates are mapped to physical 3D locations via a two-stage process:

Projective Transformation (Equation 1):

$z_c \cdot \{ ^pP \} = \{ ^p_cT \} \cdot \{ ^c_w\grave{T} \} \cdot \{ ^wP \}$

where $z_c$ is the camera-space depth, $\{ ^pP \}$ the homogeneous pixel coordinate, $\{ ^p_cT \}$ the intrinsic matrix, $\{ ^c_w\grave{T} \}$ the extrinsic transformation, and $\{ ^wP \}$ the target's world-frame coordinates.

Undistortion Correction (Equation 2):

$\begin{aligned} x_u &= x_d(1 + k_1 r^2 + k_2 r^4 + k_3 r^6) + 2p_1 x_d y_d + p_2(r^2 + 2x_d^2) \ y_u &= y_d(1 + k_1 r^2 + k_2 r^4 + k_3 r^6) + p_1(r^2 + 2y_d^2) + 2p_2 x_d y_d \end{aligned}$

where $(x_d, y_d)$ are distorted pixel values, $(x_u, y_u)$ are undistorted, $k_1, k_2, k_3, p_1, p_2$ are Brown–Conrady coefficients, and $r = \sqrt{x_d^2 + y_d^2}$ .

After undistortion, normalized 2D coordinates are mapped to 3D world coordinates using camera pose information and known scene geometry. This mapping is crucial for aligning the robot’s manipulation trajectory with observed ingredient positions.

4. Action Module and Robotic Control

The Action Module orchestrates two collaborative UR3 robotic arms. The code generated by the Language Module is systematically parsed into low-level motion primitives. One robotic arm may be equipped with a two-finger Robotiq gripper for grasping, while the other may wield utensils (such as knives). The module is responsible for:

Synchronizing arm trajectories for tasks such as grasping, cutting, pouring, placing, and tossing.
Executing real-time control using the pixel-to-world mappings provided by the Vision-Language Module.
Ensuring execution order adheres strictly to the semantic plan and that safety constraints (e.g., collision avoidance) are respected.

5. Workflow and Module Interactions

The end-to-end workflow proceeds as follows:

Input Query: User requests (spoken or typed) are interpreted via a Retrieval-Augmented Generation (RAG) system to fetch task context, such as retrieving the correct recipe from a vector database.
Semantic Planning: The Language Module produces a structured plan and converts it into code.
Perceptual Verification: The Vision-Language Module validates the availability and physical localization of required ingredients.
Execution: The Action Module sequentially carries out the planned operations using bimanual control, ensuring robust execution even for complex manipulation tasks.

6. Experimental Performance and Adaptability

The Bi-VLA system has been empirically validated in structured experiments:

Task Coverage: Evaluations include 100 user-generated requests (balanced across salad types), with the Language Module achieving 100% planning/code accuracy.
Visual Detection: Given images containing all required items, ingredient captioning achieves 96.06% accuracy and full list identification stands at 83.4%. If ingredients are missing, complete list detection rates drop to 71.22%.
Overall Execution: The combined success rate for end-to-end fulfillment is 83.4%, conditioned by Vision Module performance.
Adaptability: The system supports variable recipes and human-specified preference handling through dynamic plan generation and semantic adaptation.

A summary table of module-wise success rates:

Module	Task	Success Rate (%)
Language Module	Code generation	100
Vision Module	Ingredient detection	96.06
System Overall	Task execution	83.4

7. Limitations and Future Directions

A primary bottleneck arises from the visual perception module: misclassification or detection failures in the Vision Module directly propagate to task execution failures. To address this, future work is expected to improve robustness against visual ambiguity, handling occlusions, varying lighting, and unstructured scenes. Moreover, research directions include:

Enhancing adaptation for a broader variety of manipulation tasks beyond food preparation.
Improving perception-to-action alignment for visually ambiguous or cluttered environments.
Integrating more advanced robot learning modules, potentially incorporating online adaptation or reinforcement learning for continual improvement.

Conclusion

The Bi-VLA model defines a modular, end-to-end paradigm for bimanual dexterous robotic manipulation, unifying language understanding, vision-based scene grounding (including rigorous camera calibration and undistortion), and agile programmatic control of collaborative robots. Its empirically demonstrated capability to convert high-level instructions into robust, physically grounded task execution, together with clear limitations centered on perception accuracy, establishes Bi-VLA as an advanced prototype for interactive, real-world deployment of vision-language-action systems in robotics (Gbagbe et al., 9 May 2024).

PDF Markdown Chat (Pro)

References (1)

Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations (2024)

Follow Topic

Get notified by email when new papers are published related to Bi-VLA Model.