FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Published 26 May 2026 in cs.RO and cs.AI | (2605.27284v1)

Abstract: Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 10,816 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)--factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: https://finevla.xlang.ai/

Abstract PDF Upgrade to Chat

Authors (14)

Summary

FineVLA advances language-conditioned robotic manipulation by aligning vision-language-action models through fine-grained instructions.
The framework uses diverse datasets to unify and annotate robot trajectories, enhancing action-instruction alignment and policy steerability.
Empirical results demonstrate optimal policy steerability with mixed instruction types, improving success rates and execution compliance.

FineVLA: Action-Aligned Fine-Grained Instruction for Steerable VLA Policies

Framework Overview

"FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies" (2605.27284) introduces a comprehensive open-source pipeline to enable and rigorously evaluate fine-grained language conditioning in Vision-Language-Action (VLA) models for robotic manipulation. FineVLA builds a closed action-instruction alignment loop linking heterogeneous trajectory unification, human-verified process-level annotation with a ten-dimensional schema, scalable video-language annotation, and steerable policy learning under controlled mixtures of fine-grained and goal-level instructions.

Figure 1: FineVLA architecture connecting fine-grained data generation, process-level annotation, robotic video understanding, scalable VLM-based annotation, and steerable policy training/evaluation across simulation and real-world platforms.

FineVLA-Tool: Data Unification, Clustering, and Annotation

FineVLA-Tool aggregates 972,247 robot trajectories from ten major open datasets, canonicalizing action/state representation across embodiments and temporal conventions. Redundant demonstrations are removed with DTW-based clustering in action-space, ensuring maximum diversity per annotation budget. Representative samples are decomposed and annotated in a ten-dimensional schema covering action sequence, actor identity, target object, initial/final object configuration, contact/approach, trajectory/orientation, object interaction, failure/recovery, and body motion. Annotation is a hybrid of large-model drafting (Qwen3.5-Plus) and human review, producing FineVLA-Data: 47,159 high-density, process-level instruction episodes (average information density $\sim$ 96.8 words/trajectory, $10.4\times$ over original coarse instruction).

Figure 2: FineVLA-Tool pipeline: dataset unification, canonicalization, DTW-based clustering, sampling for diversity, multi-aspect annotation, human verification.

RoboFine-Bench: Fine-Grained Robotic Video Evaluation

RoboFine-Bench is a held-out video benchmark spanning 500 episodes strictly separated from policy and VLM training. Each trajectory carries step-level annotation fragmented into 10,816 atomic facts and 1,030 VQA questions aligned with the ten FineVLA dimensions. The evaluation protocol comprises:

VQA track: Probing discriminative understanding on entity/scene grounding, action/motion reasoning, and interaction/state transitions.
Caption track: Assessing generative alignment to ground-truth fine-grained process steps via Consistency, Coverage, and Anti-Hallucination metrics (using LLM-based judging over atomic fact sets).
Figure 3: RoboFine-Bench structure, statistics, and task coverage: video durations, diverse manipulation skills/objects, ground-truth fact distributions; example VQA/caption probes.

RoboFine-VLM: Robotics-Specialized Scalable Annotator

RoboFine-VLM, a Qwen3.5-397B-A17B SFT on FineVLA-Data, produces action-aligned step-level captions for unseen robot trajectories, supporting scalable fine-grained annotation expansion. Unlike generic VLMs, RoboFine-VLM is tuned to produce detailed execution-relevant descriptions covering critical control factors.

Policy Training: Instruction Mixtures and Steerability

FineVLA-Policy instantiates two controlled action-decoding architectures (StarVLA-OFT and StarVLA-GR00T), always using identical visual observations and actions, modifying only the paired instruction—either raw goal-level or process-level fine-grained language. Policies are trained with various FG:Raw sampling ratios: Raw-only, FG-only, and several intermediate mixtures.

Empirical Results: Benchmark, Simulation, and Real-World

RoboFine-Bench Performance

RoboFine-VLM achieves strong performance on the held-out benchmark: 71.0% overall VQA accuracy (besting Gemini-3.1-Pro by +8.9 points), with substantial gains in action/motion reasoning (68.4% vs. 58.4%) and generative caption alignment (83.6% overall under the hard setting, best across all models). Scores correlate strongly with human rankings (Pearson/Spearman $\geq 0.97$ ).

Figure 4: Caption-track model comparison in easy and hard settings; RoboFine-VLM ranks highest in both human and automatic alignment judgments.

Simulation: RoboTwin Manipulation

In RoboTwin dual-arm simulation, fine-grained (FG)-only supervision yields better success rates than raw-only across datasets and architectures (e.g., AlohaMix-OFT: +6.5/+4.7 on Easy/Hard). However, peak performance appears in mixed FG:Raw settings (1:2 or 1:1), tracing a consistent inverted-U trend. The FG:Raw=1:1 policy reaches 86.8%/82.5% on AlohaMix-OFT Easy/Hard (Raw-only: 71.8%/71.4%). Fine-grained supervision narrows architectural gaps and scales more effectively with larger, diverse training corpora.

Figure 5: RoboTwin mixing-ratio curves: FG and raw instructions are complementary; optimal steerability emerges at 1:2/1:1 mixtures.

Real-World Steerable Control

Evaluated on Cobot Magic dual-arm robotic platform, mixed fine-grained/goal-level supervision (FG:Raw=1:1) maximizes in-distribution partial score (62.7/100; Raw-only: 49.9), with the largest gains on execution-sensitive factors impenetrable to goal-level language—pose (+23), color (+18), approach (+18). Language variants produce distinct manipulation strategies in otherwise identical visual contexts, indicating explicit steerability.

Figure 6: Real-world paired control factor evaluation: color, pose, approach, rotation, and arm; fine-grained language reliably modulates execution attributes.

Analysis and Implications

Instruction Mixing and Complementarity

Fine-grained supervision augments rather than replaces goal-level instruction. Pure FG-only over-specifies, potentially reducing generalization to compact goal-language, while raw-only leaves execution choices implicit. The inverted-U trend demonstrates that optimal policy steerability arises from combining both types: goal-level language encodes task semantics, fine-grained language constrains execution.

Architecture and Scaling Effects

Dense fine-grained annotation reduces reliance on specialized action-decoding frameworks, closing gaps between architectural variants and providing more benefit as dataset diversity increases. This establishes fine-grained supervision as a scalable axis independent of architectural tweaks.

Language-Critical Factor Control

Separate evaluation on language-critical control factors quantifies compliance with process-level instruction—object pose, color, approach direction, rotation, arm—proving that fine-grained annotation directly improves compliance in factors completely absent from goal-level labels.

Benchmark Validity

Caption alignment scores are robust across LLM judges (GPT-5.4, Gemini-3.1-Pro) and closely track human preferences in ranking experiments. VQA and caption tracks jointly validate coverage of the annotation schema and demonstrate that RoboFine-VLM’s alignment is not mere task-prior exploitation.

Figure 7: Human ranking interface: multi-model caption comparison for fine-grained ranking and robust benchmark validation.

Limitations and Future Directions

Remaining failures arise from grounding errors (incorrect factor selection), execution errors (unstable manipulation), and compositional generalization (incomplete actor-target binding). RoboFine-VLM still requires partial human verification for annotation quality. Real-world validation is restricted to tabletop dual-arm manipulation with limited task set. Safety-critical deployment demands integration of feasibility checks for fine-grained instruction-following.

Conclusion

FineVLA establishes a new paradigm for steerable VLA policy design grounded in action-instruction alignment, spanning comprehensive open-source fine-grained annotation, scalable robotics-specialized VLM training, rigorous video understanding benchmarks, and controlled policy evaluation. The strong empirical results confirm:

Fine-grained supervision improves both task success and instruction-compliance without sacrificing goal-level completion.
Optimal steerable control requires mixing goal-level and process-level instructions.
Fine-grained annotation directly enables explicit modulation of execution-sensitive factors previously unaddressable.

The release of FineVLA’s tools, benchmark, models, and training code establishes reproducible foundations for research and broad practical deployment in instruction-conditioned robotic manipulation. Future work should expand compositional generalization, validation across broader task/embodiment regimes, and incorporate robust physical safety models for fine-grained language-following policies.

Markdown Report Issue