ManualVLA: Manipulation-Centered VLA Research

Updated 4 July 2026

ManualVLA is a research framework that couples language-based perception with physically-constrained manipulation, emphasizing safe contact, bimanual coordination, and explicit skill structures.
Architectural patterns include modular compliance layers, structural decomposition by embodiment, and unified generative backbones designed to enhance stability and data efficiency in execution.
ManualVLA research addresses VLAs' limits by integrating human-guided intervention, tactile and force feedback, and specialized skill libraries to improve long-horizon, contact-rich manipulation.

ManualVLA, as an Editor’s term for the literature considered here, denotes a manipulation-centered branch of vision-language-action research in which language grounding is coupled to physically constrained execution, explicit skill structure, embodiment specialization, or human collaboration rather than treated as a single monolithic policy problem. The surveyed papers do not define one canonical ManualVLA architecture; instead, they describe a family of systems that extend VLAs toward safe contact-rich execution, data-efficient bimanual coordination, long-horizon assembly, and instruction robustness under realistic deployment constraints (Zhang et al., 21 Jan 2026, Im et al., 7 Nov 2025, Ma et al., 1 Jul 2026, Song et al., 12 Jun 2025).

1. Scope and conceptual definition

In this literature, “manual” should not be read narrowly as anthropomorphic hand dexterity. A more faithful synthesis is that ManualVLA concerns manipulation regimes where semantic competence alone is insufficient, because success depends on how motion is executed, whether interaction should be rejected, which tool or arm should be active, or when a human or higher-level planner should intervene. This includes sparse expert action insertion, explicit tool invocation, skill libraries, compliance adaptation, and embodiment changes such as suction or bimanual composition (Xiang et al., 6 Mar 2025, Lei et al., 13 May 2026, Zhou et al., 26 Nov 2025).

A common through-line is that current VLAs are strong at semantic interpretation and action generation, but weak whenever execution must remain stable across contact, uncertainty, long horizons, or defective instructions. This suggests that ManualVLA is less a single model family than a design philosophy: preserve language-conditioned generality, but add mechanisms that make execution physically and operationally credible.

2. Architectural patterns

A dominant architectural pattern is modularization around the action interface. CompliantVLA-adaptor leaves the original VLA policy intact and inserts a compliance layer between VLA output and robot execution. The baseline policy remains

$\pi_\text{VLA} : \mathcal{S} \times \mathcal{T} \rightarrow \mathcal{A},$

while the adaptor adds a VLM-informed mapping

$\text{VLM}(\mathcal{S} \times \mathcal{T} \times \mathcal{F}) \rightarrow (\mathcal{K}, \mathcal{D}),$

so that desired end-effector motion is executed through context-aware variable impedance rather than rigid position tracking (Zhang et al., 21 Jan 2026).

A second pattern is structural decomposition by embodiment or control mode. TwinVLA composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA, sharing the vision encoder and DiT action head while fully replicating the VLM and adding joint attention across the two branches; the resulting model is 1.3B parameters, compared to 1.2B for RDT-1B (Im et al., 7 Nov 2025). DAM-VLA introduces an action router that selects between an arm-movement model and a gripper-manipulation model, using a reasoning latent for routing and a cognition latent for generation; its action is represented as

$\boldsymbol{a}_{t} = [\delta \boldsymbol{x}, \delta \boldsymbol{\theta}, s^{grip}],$

with separate phase-specific visual conditioning for gross motion and local manipulation (Peng et al., 1 Mar 2026).

A third pattern is hierarchical language-to-latent mediation. RationalVLA inserts a high-level MLLM in front of a low-level 3D Diffuser Actor, using the special tokens <[ACT](https://www.emergentmind.com/topics/attention-calibration-technique-act)> and <REJ> to either emit a latent control embedding or reject the instruction entirely. The low-level policy then acts as

$a = \pi_{\theta}(o, p, z),$

with $z$ derived from the high-level language-vision judgment rather than from raw text alone (Song et al., 12 Jun 2025).

A fourth pattern is unified large-generation backbones rather than explicit modular heads. iFlyBot-VLA combines a Qwen2.5-VL (3B) backbone with a Flow-Matching Diffusion Transformer, jointly supervising the VLM with latent action tokens and structured discrete action tokens; on LIBERO it reports 93.8%, compared with 86.0% for $\pi_0$ and 76.5% for OpenVLA (Zhang et al., 1 Nov 2025). MMaDA-VLA pushes this further by embedding language, images, and continuous robot controls into one discrete token space and jointly generating a future goal observation and an action chunk through native discrete diffusion, reaching 98.0% average success on LIBERO and 4.78 average length on CALVIN (Liu et al., 26 Mar 2026). These systems suggest that ManualVLA can also take the form of a large unified generative model, provided that long-horizon consistency and action grounding are explicitly addressed.

3. Contact mechanics, compliance, and embodiment

A central ManualVLA theme is that manipulation quality depends on contact mechanics, not only on semantic correctness. CompliantVLA-adaptor states this most explicitly: existing VLA systems such as RDT, Pi0, and OpenVLA-oft are strong at semantic interpretation and action generation but are still mostly executed through position or trajectory control, which becomes unsafe in insertion, constrained motion, drawer operation, or pushing against resistance. Its adaptor uses four contact phases—Free-motion, Approaching, Contact, and Retreat—together with anisotropic translational impedance and real-time force regulation at 1000 Hz, under a 30 N safety threshold. Across all tasks, the average success rate increases from 9.86% to 17.29% (Zhang et al., 21 Jan 2026).

Another route is embodiment redesign. VacuumVLA argues that many VLA failures are failures of the end effector rather than of semantic reasoning, and replaces the default parallel gripper assumption with a unified suction-plus-gripper tool. The system supports gripping only, suction only, and combined use across task sequences. On four long-horizon tasks chosen to be infeasible for conventional two-finger grippers, the gripping-only DexVLA baseline is 0.0% on all tasks, whereas VacuumVLA with DexVLA reaches 73.3%, 80.0%, 53.3%, and 33.3%, and VacuumVLA with $\pi_0$ reaches 53.3%, 66.67%, 60.0%, and 53.3% (Zhou et al., 26 Nov 2025). This is strong evidence that ManualVLA should treat embodiment as a first-class variable.

SELF-VLA reaches a similar conclusion from the opposite direction: rather than changing the end effector, it changes the control structure. The framework combines a VLA-planner, a manually structured skill library, and a VLA-corrector. The CPU extraction skill contains 23 waypoints, and the RAM removal skill contains 8 waypoints; the most contact-sensitive part of the task is executed through explicit skills rather than end-to-end VLA action generation (Liu et al., 10 Mar 2026). This suggests that manual structure inside the controller can substitute for, or complement, richer end-effector hardware.

Mag-VLA generalizes the same logic to a highly unusual setting—bimanual magnetic microrobot manipulation—where the robot does not directly contact the manipulated object. It uses a LoRA-adapted Qwen2.5-VL-7B backbone, a motion-aware phase classifier, and a phase-conditioned ACT decoder, achieving a 90% approach success rate and transport success rates of 80%, 70%, and 50% as task difficulty increases (Wang et al., 27 May 2026). The broader implication is that ManualVLA is not restricted to rigid grasping; it also includes indirect, non-contact, or tool-mediated manipulation whenever execution depends on specialized low-level structure.

4. Bimanuality, memory, and long-horizon composition

ManualVLA research frequently treats long-horizon manipulation as a composition problem rather than a one-policy problem. TwinVLA shows this in the bimanual regime: it reuses single-arm pretraining, adds joint attention for cross-arm communication, and fine-tunes on as few as 50 demonstrations per task, outperforming a comparably sized monolithic RDT-1B and approaching $\pi_0$ without bimanual pretraining (Im et al., 7 Nov 2025). Its results support the idea that many bimanual tasks can be factorized into two arm-specialized branches plus structured communication.

FurnitureVLA addresses the same issue in real-scale assembly. It decomposes IKEA furniture assembly into semantically grounded subtasks, augments each action with a continuous within-subtask progress scalar,

$\tilde a_t = [a_t^\top, p_t]^\top,$

and uses predicted progress to trigger subtask transitions. The system improves average simulation success from 48% to 80% across three furniture types, obtains an additional 21% gain from design-factor tuning, and reports only a 16% drop on the hardest real task (Ma et al., 1 Jul 2026). Here ManualVLA appears as a combination of stage conditioning, progress estimation, and carefully engineered perception-control interfaces.

A more agentic formulation appears in VLAs-as-Tools. Instead of extending one VLA to cover both global planning and local execution, the paper defines an interface

$\mathcal{I}=(\mathcal{C},\mathcal{R}),$

where the planner emits an invocation $\text{VLM}(\mathcal{S} \times \mathcal{T} \times \mathcal{F}) \rightarrow (\mathcal{K}, \mathcal{D}),$ 0 consisting of a tool-family label and a scene-grounded local instruction, and the tool returns progress feedback for event-triggered replanning. Tool-Aligned Post-Training then trains the VLA on invocation-aligned segments rather than on only full-task rollouts. This improves $\text{VLM}(\mathcal{S} \times \mathcal{T} \times \mathcal{F}) \rightarrow (\mathcal{K}, \mathcal{D}),$ 1 by 4.8 points on LIBERO-Long and 23.1 points on RoboTwin (Lei et al., 13 May 2026). The implication is that ManualVLA often benefits from making the VLA callable as a tool rather than deploying it as the sole long-horizon agent.

EchoVLA extends the same argument to mobile manipulation by adding scene memory and episodic memory, reaching 0.52 success rate on manipulation/navigation and 0.31 on mobile manipulation, exceeding $\text{VLM}(\mathcal{S} \times \mathcal{T} \times \mathcal{F}) \rightarrow (\mathcal{K}, \mathcal{D}),$ 2 by +0.08 and +0.11 (Lin et al., 22 Nov 2025). This suggests that ManualVLA must often preserve state across viewpoint changes and task phases, not merely across short action histories.

5. Human interfaces, instruction robustness, and manual guidance

ManualVLA is also defined by the ways humans enter the control loop. The most direct example is the VLA model-expert collaboration framework, where a VLA executes for $\text{VLM}(\mathcal{S} \times \mathcal{T} \times \mathcal{F}) \rightarrow (\mathcal{K}, \mathcal{D}),$ 3 steps and an expert executes one step, with the expert actions later reused as training data. At a 4:1 VLA/expert ratio, average human-executed steps drop from 100.12 to 17.78, a reduction of 82.24%, while collaborative learning further improves Octo from 0.692 to 0.730 in pure-VLA success and from 0.716 to 0.852 in collaborative success (Xiang et al., 6 Mar 2025). This is a sparse-intervention ManualVLA rather than a full shared-control blend.

VLAS expands the interface modality from text to raw speech. It integrates a Whisper encoder into a LLaVA-style manipulation policy and adds Voice RAG so that speaker identity can retrieve personalized knowledge. On CALVIN with speech instructions, VLAS reports length 3.70, compared with 3.13 for VLA + ASR and 3.41 for RoboFlamingo + ASR; on the customization benchmark it reaches 86.5% with Voice RAG and 16.0% without it (Zhao et al., 19 Feb 2025). The key lesson is that ManualVLA may need to preserve non-textual instruction information, not merely transcribe it.

RationalVLA addresses a different human-facing problem: instructions are not always valid. RAMA introduces over 14,000 samples, including defective instructions across six dimensions—visual, physical, semantic, motion, safety, and out-of-context—and RationalVLA uses a dual-system architecture to either reject or execute. The paper reports a 14.5% higher success rate and 0.94 average task length improvement over baselines while maintaining competitive performance on standard manipulation tasks (Song et al., 12 Jun 2025). In a ManualVLA setting, this is the rationality layer that prevents physically harmful obedience.

Two adjacent systems illustrate how far this interface logic can extend beyond tabletop manipulation. DroneVLA combines natural-language object retrieval, Grounding DINO, dynamic A* planning, and MediaPipe-based human-centric handover; its VLA component predicts only binary gripper actions and was validated in Unity rather than in the real flight loop, while the integrated real-world system reports 0.164 m max error, 0.070 m mean Euclidean error, and 0.084 m RMSE for localization and navigation (Mehboob et al., 20 Jan 2026). UAV-VLA moves even higher in the stack, compiling text and satellite imagery into aerial mission plans; it is 6.5× faster than a human operator, but its trajectories are 21.6% longer on average and its best geometric agreement metric is 34.22 m mean KNN RMSE (Sautenkov et al., 9 Jan 2025). These systems are not central ManualVLA exemplars, but they show that language-conditioned action generation can operate above the direct control layer as well.

6. Limits, misconceptions, and open questions

A recurring misconception is that ManualVLA denotes a single new end-to-end foundation model. The literature instead contains adaptors, wrappers, tool interfaces, dual systems, and skill libraries. CompliantVLA-adaptor is explicitly a plug-in compliance and safety adaptor rather than a new foundation model (Zhang et al., 21 Jan 2026). SELF-VLA relies on explicit waypoint skills for contact-rich disassembly rather than on unconstrained end-to-end VLA generation (Liu et al., 10 Mar 2026). VLAs-as-Tools shows that planner wrapping alone can hurt unless the VLA is post-trained to behave as a faithful tool (Lei et al., 13 May 2026). ManualVLA, in practice, is therefore often a hybrid engineered stack.

A second misconception is that ManualVLA is synonymous with dexterous hand manipulation. Much of the strongest evidence still comes from arm-level end-effector control, parallel grippers, suction attachments, or translational impedance adaptation rather than from multi-finger tactile dexterity (Zhou et al., 26 Nov 2025, Ma et al., 1 Jul 2026, Zhang et al., 21 Jan 2026). This suggests that current ManualVLA research is broader than hand dexterity and narrower than fully general manual skill.

The limitations are correspondingly clear. CompliantVLA-adaptor reports that real-world zero-shot performance remains limited and that only the simplified task “keep pushing the red box straight ahead” is successfully completed (Zhang et al., 21 Jan 2026). TwinVLA still struggles on hard RoboTwin tasks and can forget single-arm skills after bimanual fine-tuning (Im et al., 7 Nov 2025). VacuumVLA lacks explicit sensing of true versus false suction, and its one-motor, one-valve, dual-cup design can leak when only one cup seals well (Zhou et al., 26 Nov 2025). FurnitureVLA depends on manual subtask decomposition and primitive boundaries, with 100 teleoperated demonstrations for the real-world IVAR task (Ma et al., 1 Jul 2026). DAM-VLA, despite strong gains, still routes only between two action models and remains a 7-DoF end-effector plus binary gripper framework rather than a full dexterous-hand architecture (Peng et al., 1 Mar 2026).

The open problems are therefore consistent across the literature. This suggests that the next stage of ManualVLA will likely require tighter integration of tactile or force feedback, richer rejection and recovery logic, better real-robot scaling, more faithful planner-tool interfaces, and action representations that preserve both semantic flexibility and contact-level precision. Unified diffusion backbones and dual-level action representations already point in that direction, but the surveyed systems show that physical interaction competence is still being assembled from modular priors rather than solved by scale alone (Liu et al., 26 Mar 2026, Zhang et al., 1 Nov 2025).