VIPA-VLA: 3D Visual-Physical Alignment

Updated 4 July 2026

VIPA-VLA is a vision-language-action framework that pretrains using a dual-encoder design to align 2D semantic features with 3D spatial structures.
It leverages human video data and dense point cloud calibration to bridge the gap between 2D perception and 3D physical robot manipulation.
The two-stage pretraining approach enhances spatial reasoning and motion prediction, yielding improved performance in simulation and real-robot tasks.

Searching arXiv for the cited VIPA-VLA paper and closely related VLA papers to ground the article in current literature. Found the primary VIPA-VLA paper (Feng et al., 15 Dec 2025) and related contemporaneous VLA approaches including VP-VLA (Wang et al., 23 Mar 2026) and ReViP (Li et al., 23 Jan 2026), which are useful for positioning VIPA-VLA within the broader VLA literature. VIPA-VLA is a vision-language-action (VLA) framework instantiated within the paradigm of spatial-aware VLA pretraining through visual-physical alignment from human videos. It addresses a central limitation of many VLA systems: the use of 2D visual inputs to generate actions in 3D physical environments, which creates a gap between perception and action grounding. The framework augments pretrained vision-LLMs with explicit 3D spatial features, aligns semantic visual tokens with physical-space representations during pretraining, and then adapts the resulting model to downstream robot manipulation tasks in simulation and on real robots (Feng et al., 15 Dec 2025).

1. Problem setting and conceptual scope

VIPA-VLA is designed for the VLA regime in which a model must map visual observations and language instructions to actions, but do so with stronger 2D-to-3D grounding than is typical in purely image-based policies. The motivating claim is that most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, and that this induces a significant mismatch between what is perceived and what must be executed physically. The proposed remedy is a pretraining paradigm that performs explicit alignment between visual space and physical space before robot policy learning, using large-scale human demonstration videos as supervision for both 3D visual reasoning and 3D action understanding (Feng et al., 15 Dec 2025).

Within that paradigm, VIPA-VLA is the concrete model instantiation. Its defining property is a dual-encoder design: one encoder preserves high-level semantic content from an off-the-shelf vision-language backbone, while the other injects 3D-aware structure derived from dense point-cloud estimation. This makes the framework neither a purely geometric policy nor a standard end-to-end VLM finetuning recipe. Instead, it is a pretraining-centered approach that seeks to endow a VLA model with spatial priors before downstream robot adaptation.

A common source of confusion is nomenclature. VIPA-VLA should be distinguished from the unrelated privacy-oriented work "VIP: Visual Information Protection through Adversarial Attacks on Vision-LLMs," which concerns adversarial concealment of region-of-interest content in vision-LLMs rather than robotic spatial grounding (Meftah et al., 11 Jul 2025).

2. Dual-encoder architecture and fusion mechanism

The architecture consists of a frozen semantic vision encoder, a pretrained 3D vision encoder, and a cross-attention fusion layer. The semantic branch is an off-the-shelf vision-LLM backbone such as InternVL3.5-2B, which produces a sequence of 2D semantic tokens $V_{\mathrm{sem}} \in \mathbb{R}^{N_v \times d_v}$ . The spatial branch is a pretrained Cut3R network that estimates dense point clouds $P$ and yields 3D spatial tokens $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ . The explicit intent is to augment semantic visual representations with 3D-aware features rather than replace them (Feng et al., 15 Dec 2025).

Fusion is performed by projecting the two token streams into a common attention space and allowing semantic tokens to query 3D tokens:

$Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$

$F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$

The fused representation is then injected back into the semantic stream via

$V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$

with $\alpha$ a learnable scalar initialized to $0.5$, followed by LayerNorm and dropout.

The language side is extended to represent motion explicitly. The LLM token embedding space is enlarged by $K^3$ discrete motion tokens along $x$ , $P$ 0, and $P$ 1, with $P$ 2 per axis. A 3D waypoint $P$ 3 is quantized to $P$ 4. This design ties visual grounding to a tokenized action vocabulary, permitting the same model family to answer spatial questions during pretraining and to predict future motion sequences during action pretraining.

Architecturally, the model is lightweight in the sense that the main semantic encoder remains frozen during the first stage, and the core alignment burden is concentrated in the fusion layer. A plausible implication is that the framework treats 3D grounding as an interface problem between pretrained semantic representations and spatial evidence, rather than as a requirement to retrain an entire multimodal backbone from scratch.

3. Human-video supervision and visual-physical alignment data

The supervision pipeline is built from human manipulation videos drawn from ARCTIC, HOI4D, FPHA, H2O, OAKInk2, TACO, Dex-YCB, EgoDex, Taste-Rob, and related sources. Raw hand joint annotations are fit to the MANO model to obtain 3D joint sets $P$ 5. Dense point clouds $P$ 6 are estimated per frame with Cut3R, while 2D object proposals are generated by Gemini-2.5-Flash together with GroundingDINO to produce object bounding boxes $P$ 7 (Feng et al., 15 Dec 2025).

A crucial step is scale calibration between the estimated point cloud and the hand annotations. Let $P$ 8 denote absolute hand-joint depths and $P$ 9 the corresponding depths derived from $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 0. The scale factor is

$V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 1

after which the point cloud is rescaled as $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 2 so that hands and objects lie in a unified real-world coordinate frame. Hand joints are projected by camera intrinsics $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 3 and extrinsics $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 4 and filtered by visibility.

The dataset is split into two supervisory channels. The first comprises approximately $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 5 visual question-answer pairs. These cover four categories: spatial relation, task completion, hand movement, and camera movement. For directional annotation, an offset vector $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 6 is normalized to $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 7, and directions are discretized into labels such as right/left, up/down, and forward/backward according to component thresholds. The second channel comprises approximately $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 8 instruction-motion pairs, produced by extracting wrist trajectories from MANO over time and quantizing them into motion tokens, with text instructions and interaction labels created via Gemini-2.5-Flash/Pro for instructional motion generation, contextual motion prediction, and motion translation.

This construction is notable because it does not require robot-collected 3D annotations at pretraining time. Instead, it uses human video as a source of supervision for visual-physical alignment. That choice suggests a transfer hypothesis: if a model learns to align image evidence with 3D hand-object structure before robot finetuning, downstream robotic policies may inherit stronger spatial grounding.

4. Two-stage pretraining and downstream robotics adaptation

The pretraining pipeline has two tightly coupled stages. In Stage 1, 3D-visual pretraining aligns 2D semantic features with 3D spatial features. The semantic encoder, 3D encoder, and LLM are frozen, and only the fusion layer is trained. The model receives $V_{\mathrm{spa}} \in \mathbb{R}^{N_s \times d_s}$ 9 to $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 0 consecutive frames $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 1 and a natural-language question $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 2 about spatial relations, computes fused features $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 3, and predicts an answer distribution $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 4. The loss is a cross-entropy objective over VQA targets: $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 5 (Feng et al., 15 Dec 2025)

In Stage 2, 3D-action pretraining teaches motion priors. Both vision encoders remain frozen, while the fusion layer and LLM are trained jointly after extending the LLM vocabulary with motion tokens. The input is a single frame $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 6 and an instruction $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 7, and the model autoregressively predicts the future motion token sequence $Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 8. The objective is

$Q = V_{\mathrm{sem}} W_Q,\qquad K = V_{\mathrm{spa}} W_K,\qquad V = V_{\mathrm{spa}} W_V,$ 9

The total pretraining loss is

$F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 0

After pretraining, the model is adapted to robotics with a flow-matching objective. Given a ground-truth action chunk $F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 1 and state $F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 2, one samples $F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 3 and Gaussian noise $F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 4, forms

$F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 5

concatenates $F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 6 with $F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 7, conditions a DiT on $F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 8 from the VLM, and minimizes

$F_{\mathrm{spa}} = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V.$ 9

Functionally, Stage 1 teaches the model to answer spatial questions from fused 2D and 3D evidence, whereas Stage 2 teaches it to express future motion in a discrete 3D action vocabulary. The downstream flow-matching phase then converts those priors into robot control.

5. Quantitative performance across reasoning, simulation, and real robots

The reported evaluations cover unseen VQA pairs for 3D-spatial reasoning, LIBERO manipulation, RoboCasa, and real-robot tasks. The most direct comparisons are summarized below.

Setting	Comparator(s) from the paper	VIPA-VLA
3D-spatial reasoning	InternVL3.5 baseline: distance error $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 0, direction score $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 1; +3D-visual pretraining only: $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 2, $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 3	distance error $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 4, direction score $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 5
LIBERO average	SpatialVLA: $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 6; GR00T N1.5: $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 7	$V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 8
RoboCasa	Doors/Drawers average improved by $V_f = V_{\mathrm{sem}} + \alpha \cdot F_{\mathrm{spa}},$ 9 over GR00T	overall average $\alpha$ 0
Real robot	InternVL3.5: Put-3-Obj $\alpha$ 1, Wipe-Board $\alpha$ 2, Water-Plant $\alpha$ 3	Put-3-Obj $\alpha$ 4, Wipe-Board $\alpha$ 5, Water-Plant $\alpha$ 6

On unseen VQA pairs, the progression from InternVL3.5 to a 3D-visual-pretrained variant and then to full VIPA-VLA shows steady improvement in both distance estimation and directional reasoning, culminating in $\alpha$ 7 error and $\alpha$ 8 directional score. This is the clearest direct evidence that the alignment machinery improves 3D-spatial reasoning rather than merely downstream control (Feng et al., 15 Dec 2025).

On LIBERO, evaluated with $\alpha$ 9 trials for each suite in the single-view setting, VIPA-VLA records $0.5$0 on Spatial, $0.5$1 on Object, $0.5$2 on Goal, $0.5$3 on Long, and $0.5$4 on average. The corresponding averages for SpatialVLA and GR00T N1.5 are $0.5$5 and $0.5$6. On RoboCasa, which comprises $0.5$7 tasks with $0.5$8 trials each, the paper reports a Doors/Drawers average increase of $0.5$9 over GR00T and an overall average of $K^3$ 0.

On real-robot evaluation with Franka+Inspire, the reported subtask/whole-task results are $K^3$ 1 for Put-3-Obj, $K^3$ 2 for Wipe-Board, and $K^3$ 3 for Water-Plant, compared with InternVL3.5 at $K^3$ 4, $K^3$ 5, and $K^3$ 6. Across these settings, the performance pattern is consistent with the stated claim that spatially aware pretraining improves grounding between 2D vision and 3D action.

6. Ablations, failure modes, and relation to adjacent VLA work

The ablation studies separate the effect of the dual-encoder design from the effect of spatial-aware pretraining. On LIBERO, full VIPA-VLA attains an average of $K^3$ 7. Removing 3D-visual pretraining reduces this to $K^3$ 8, removing the dual encoder reduces it to $K^3$ 9, and removing both components reduces it to $x$ 0. The paper further states that Stage 1 yields large spatial gains, whereas Stage 2 further improves action-level grounding, with qualitative visualizations indicating smoother trajectories (Feng et al., 15 Dec 2025).

The reported failure modes are also specific. VIPA-VLA errors are described as fine-grained, such as slight grasp offsets, while baseline failures are described as gross 2D-3D mislocalization. This distinction matters because it indicates not merely better aggregate success rates but a shift in the error regime from coarse grounding failures toward smaller execution inaccuracies.

In the broader VLA literature, VIPA-VLA belongs to a family of approaches that make grounding more explicit, but it does so at pretraining time rather than only at inference time. VP-VLA decouples high-level reasoning and low-level execution through a structured visual prompting interface, using a "System 2 Planner" and a "System 1 Controller," and reports gains of $x$ 1 on Robocasa-GR1-Tabletop and $x$ 2 on SimplerEnv (Wang et al., 23 Mar 2026). ReViP addresses a different failure mode—state-dominant bias and false completion—by using an external VLM as a task-stage observer and Vision-Proprioception Feature-wise Linear Modulation, and reports improvements on its False-Completion Benchmark Suite, LIBERO, RoboTwin 2.0, and real-world evaluation (Li et al., 23 Jan 2026). This suggests a broader research trend toward explicit interfaces for grounding, whether the interface is 3D visual-physical alignment, structured visual prompts, or modulation of vision-proprioception coupling.

VIPA-VLA is therefore best understood not as a generic VLA policy, and not as a privacy or adversarial method despite the similarity in acronyms, but as a spatial-aware pretraining framework for improving 2D-to-3D grounding in robot learning. Its central claim is that explicit visual-physical alignment from human videos can provide a useful intermediate substrate between pretrained vision-language semantics and downstream robotic action.