Spatial-Aware VLA Pretraining

Updated 4 July 2026

Spatial-Aware VLA Pretraining is a class of methods that embeds explicit 3D geometry into models originally pretrained on 2D image-text pairs to improve robotic manipulation precision.
It employs diverse strategies such as 3D position encoding, pretrained geometry experts, structured spatial tokenization, and correspondence-based supervision to bridge the 2D-3D gap.
Empirical results on benchmarks like LIBERO demonstrate enhanced spatial reasoning and cross-embodiment transfer, challenging misconceptions about simple depth concatenation.

Spatial-aware VLA pretraining denotes a class of training strategies that attempt to close the mismatch between predominantly $2$D visual-language pretraining and $3$D physical action execution in robotic manipulation. Across recent work, the common objective is to endow a vision-language-action model—or a precursor VLM, MLLM, geometry branch, or latent-action model—with spatial priors before or during policy learning, so that language grounding, perception, and control are expressed in a representation that is more faithful to scene geometry, object relations, embodiment constraints, and temporal evolution (Qu et al., 27 Jan 2025, Zhang et al., 27 Jun 2025, Feng et al., 15 Dec 2025, Yuan et al., 15 Oct 2025). This suggests that “spatial awareness” in VLA pretraining is not a single mechanism but a family of interventions spanning egocentric $3$D positional encoding, depth-aware experts, multi-view correspondence learning, geometry-aligned latent actions, structured spatial reasoning, and $4$D spatiotemporal interfaces.

1. Problem formulation and motivating gap

The motivating diagnosis is recurrent across the literature. Standard VLA models inherit strong semantic and language-grounding priors from pretrained VLMs, but these priors are usually learned from RGB image-text pairs and therefore provide limited geometric grounding for precise manipulation. SpatialVLA states that standard VLA models treat the robot’s world as a $2$D image plus discretized action bins and therefore lack an explicit $3$D workspace representation, which impairs cross-embodiment generalization, precise spatial reasoning, and transfer to new robots (Qu et al., 27 Jan 2025). DepthVLA similarly argues that existing VLAs rely on extensive action-data pretraining to ground VLMs in $3$D space, yet remain weak at fine geometric reasoning (Yuan et al., 15 Oct 2025). VIPA-VLA frames the issue as a direct $2$D $\rightarrow 3$ D gap: most VLAs consume only $2$D images or video frames but must output $3$0D physical actions, which leads to poor grounding between perception and action (Feng et al., 15 Dec 2025).

Several works refine this diagnosis into more specific failure modes. 4D-VLA identifies “coordinate system chaos,” arising when the robot body is partially or entirely out of view, and “state chaos,” arising when a single frame lacks temporal context needed to disambiguate action direction or latent task state (Zhang et al., 27 Jun 2025). SA-VLA describes a related problem at adaptation time: RL fine-tuning can erode the spatial inductive bias that was implicitly present in a pretrained flow-matching policy, producing brittle, phase-inconsistent trajectories under viewpoint shift or clutter (Pan et al., 31 Jan 2026). GASP broadens the critique beyond robotics, arguing that VLMs often fail at genuine $3$1D spatial understanding and that fine-tuning only on $3$2D VQA data may encourage memorization of dataset-specific biases rather than geometric competence (Yeh et al., 28 May 2026).

A plausible implication is that spatial-aware pretraining is best understood as an attempt to reduce ambiguity in the conditional action distribution by supplying geometric regularities earlier in the learning pipeline, rather than expecting those regularities to emerge solely from downstream imitation or reinforcement signals.

2. Spatial representations and geometry injection mechanisms

A central design axis is how geometric information is represented and injected. One family of methods augments RGB features with explicit $3$3D position. SpatialVLA introduces Ego3D Position Encoding: depth $3$4 is used to back-project each pixel into an egocentric $3$5D point field $3$6, a sinusoidal positional encoding $3$7 is mapped through a small MLP, and the result is fused with SigLIP features as

$3$8

so that each patch carries both appearance and camera-frame $3$9D context (Qu et al., 27 Jan 2025). 4D-VLA performs a related operation in world coordinates by back-projecting each feature cell according to

$3$0

then adding a learnable $3$1D positional embedding $3$2 before projection into spatial tokens (Zhang et al., 27 Jun 2025).

A second family uses pretrained geometry experts rather than directly encoding positions. DepthVLA introduces a pretrained monocular depth expert $3$3 alongside a VLM expert $3$4 and an action expert $3$5 inside a mixture-of-transformers architecture with fully shared self-attention and expert-specific feed-forward blocks. The block-wise attention mask preserves semantic and geometric priors by allowing VLM tokens to attend only to VLM tokens, depth tokens only to depth tokens, and action tokens to all streams (Yuan et al., 15 Oct 2025). Evo-0 injects geometry from a frozen Visual Geometry Grounded Transformer (VGGT) into a frozen PaliGemma-based VLA through a single cross-attention fuser, thereby retaining the RGB-only deployment interface while borrowing geometry representations from a visual geometry foundation model (Lin et al., 1 Jul 2025). GEAR-VLA adopts a closely related but semantically conservative strategy: a frozen $3$6D ViT preserves the original VLM-aligned pathway, while a trainable VGGT-based $3$7D backbone is connected through a zero-initialized fusion projection so that geometry cues are added without destabilizing pretrained semantics (Zhang et al., 7 Jun 2026).

A third family compresses scene geometry into more structured spatial tokens. GST-VLA replaces scalar depth augmentation with a Gaussian Spatial Tokenizer that converts frozen dense depth and frozen semantic patch features into $3$8 anisotropic $3$9D Gaussian primitives, each parameterized by a residual mean $4$0, log-scale covariance $4$1, and learned opacity $4$2 (Sarowar et al., 10 Mar 2026). The paper emphasizes that covariance encodes local surface orientation and opacity encodes geometric confidence, both unavailable from scalar depth alone. GeoAlign takes a different route: it post-trains an RGB geometry branch with robot-domain RGB-D supervision, discards the depth head after post-training, and retains a full image-space grid of Geometry-Enhanced Post-Trained tokens that are later queried by the robot’s proprioceptive state (Chen et al., 2 Jun 2026).

A fourth family focuses on internal transformer geometry rather than explicit depth tokens. GASP attaches a small correspondence head to every transformer layer and supervises it with multi-view point correspondences and depth consistency, thereby restructuring the model’s query-key space toward view-invariant, depth-aware matching without adding inference-time modules (Yeh et al., 28 May 2026). GAP-MLLM likewise argues that geometric priors are often present but inactive; it activates them through a geometry-aligned pretraining stage and a multi-level progressive fusion module with token-level gating (Zhang et al., 17 Mar 2026).

These representational choices embody different assumptions. Explicit $4$3D position encoding privileges metric locality; pretrained geometry experts privilege transferable priors; structured tokens privilege compactness and inspectability; correspondence-based supervision privileges layerwise invariance. The literature does not converge on a single substrate, but it consistently rejects purely texture-driven $4$4D patch features as sufficient for manipulation-grade spatial grounding.

3. Pretraining supervision and optimization objectives

Spatial-aware VLA pretraining is distinguished not only by representation choice but also by supervision choice. DepthVLA retains explicit depth prediction during policy training. Its action policy maps observation, language, and proprioception to an action chunk,

$4$5

and is trained with a joint objective

$4$6

where $4$7 is the flow-matching loss for continuous actions and $4$8 is the scale-invariant log loss used for depth estimation (Yuan et al., 15 Oct 2025). The important claim is that depth supervision is not treated as a one-time pretraining stage only; it is retained during policy training to preserve spatial reasoning.

GASP uses a different form of geometry supervision. It combines a contrastive correspondence loss $4$9 with a depth-consistency loss $2$0, together with the model’s original language modeling loss,

$2$1

with $2$2 and $2$3 in the reported implementation (Yeh et al., 28 May 2026). The conceptual shift is that geometry is learned from fundamental perceptual structure—cross-view correspondences and depth disambiguation—rather than from downstream $2$4D VQA pairs.

VIPA-VLA separates supervision into two pretraining stages. Stage $2$5 trains a cross-attention fusion layer on approximately $2$6K $2$7D visual VQA pairs derived from human videos, using only standard cross-entropy on answers, while all other weights remain frozen. Stage $2$8 extends the LLM vocabulary with motion tokens and trains autoregressively on approximately $2$9M video-instruction-motion examples derived from human wrist trajectories discretized into $3$0 bins per axis (Feng et al., 15 Dec 2025). This is explicit visual-physical alignment: first align semantics with $3$1D visual structure, then align fused perception with tokenized $3$2D motion.

Latent-action methods use another supervisory route. UniLACT learns modality-specific RGB and depth latent actions and a unified latent $3$3 through inverse-dynamics encoders, codebooks, and forward-dynamics decoders, then uses those unified latents as pseudo-labels for autoregressive VLA pretraining from RGB and language alone (Govind et al., 23 Feb 2026). SSM-VLA pretrains Farsighted-LAM with geometry-aware spatial encoding and multi-scale temporal modeling, using RGB reconstruction, depth loss, and future-keyframe prediction, and then inserts the learned latent structure into a three-stage VLA pipeline with VisualCoT, latent inference, and flow-matching action generation (Cai et al., 30 Sep 2025).

ST-VLA broadens the pretraining target from $3$4D geometry to $3$5D spatiotemporal structure. ST-VLM is fine-tuned on ST-Human with prompts for $3$6D grounding, trajectory prediction, depth offsets, spatial QA, and $3$7D planning, using cross-entropy for coordinate quantization, $3$8 losses on continuous depth offsets, QA classification loss, and trajectory RMSE (Wu et al., 14 Mar 2026). This suggests that spatial-aware pretraining increasingly treats temporal coherence and replanning as first-class supervision rather than as downstream control phenomena.

4. Architectural patterns for bridging semantics, geometry, and action

Despite varied details, several recurrent architectural motifs have emerged. One is dual-stream or multi-expert fusion. DepthVLA’s mixture-of-transformers, VIPA-VLA’s dual-encoder design, Evo-0’s frozen VLM plus frozen VGGT connected by cross-attention, and GEAR-VLA’s frozen $3$9D pathway plus trainable $3$0D pathway all exemplify the view that semantic grounding and geometry should be complementary but not destructively entangled (Yuan et al., 15 Oct 2025, Feng et al., 15 Dec 2025, Lin et al., 1 Jul 2025, Zhang et al., 7 Jun 2026). DepthVLA’s ablation that allowing VLM↔Depth cross-attention reduced Simpler performance to $3$1 from $3$2 is often read as evidence that naive fusion can corrupt geometry transfer rather than improve it (Yuan et al., 15 Oct 2025).

A second motif is query-based spatial selection. GeoAlign does not average-pool the geometry grid; instead, the robot’s proprioceptive state generates $3$3 geometry queries that cross-attend to the GEP feature map, producing compact, phase-dependent geometry tokens for action prediction (Chen et al., 2 Jun 2026). GST-VLA similarly uses learned pooling queries to concentrate a fixed token budget on geometrically salient regions (Sarowar et al., 10 Mar 2026). These designs reject the assumption that all spatial regions are equally action-relevant.

A third motif is explicit intermediate reasoning. GraphCoT-VLA, according to its abstract, adds a structured Chain-of-Thought reasoning module integrating high-level task understanding and planning, failed task feedback, and low-level imaginative reasoning about future object positions and robot actions, together with a real-time updatable $3$4D Pose-Object graph (Huang et al., 11 Aug 2025). GST-VLA operationalizes this trend with a supervised Depth-Aware Chain-of-Thought composed of four structured thoughts: $3$5D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse $3$6 waypoints (Sarowar et al., 10 Mar 2026). SSM-VLA introduces a VisualCoT stage that predicts the next state before latent planning and action generation (Cai et al., 30 Sep 2025). ST-VLA externalizes high-level spatiotemporal reasoning into a VLM that outputs normalized $3$7D trajectories, relative depth offsets, and sub-instructions, which are then lifted into a unified $3$8D–$3$9D representation for a lower-level controller (Wu et al., 14 Mar 2026).

A fourth motif is staged training. GeoAlign separates geometry-enhanced post-training from policy learning (Chen et al., 2 Jun 2026). GST-VLA uses a three-stage curriculum: pretrain GST and action expert with depth and flow losses, activate LoRA and DA-CoT supervision, then jointly fine-tune all non-frozen parameters (Sarowar et al., 10 Mar 2026). Green-VLA adopts a five-stage curriculum from foundational VLMs through multimodal grounding, multi-embodiment robotics pretraining, embodiment-specific adaptation, and RL policy alignment (Apanasevich et al., 31 Jan 2026). The repeated use of curricula suggests that spatial competence is often easier to preserve when introduced progressively rather than through a single monolithic optimization.

5. Action representations, embodiment transfer, and spatiotemporal grounding

Spatial-aware pretraining does not stop at perception. Many methods modify the action side so that geometry-aware representations remain usable across robots, tasks, and timescales. SpatialVLA’s Adaptive Action Grids discretize translation, rotation, and gripper state into three tokens per step, with Gaussian-equal-probability bins derived from the empirical action distribution and re-discretizable for new robots through embedding interpolation (Qu et al., 27 Jan 2025). The stated purpose is to learn generalizable and transferable spatial action knowledge for cross-robot control. GEAR-VLA pursues embodiment invariance more explicitly through relative end-effector $2$0 actions, embodiment-aware states, embodiment-invariant actions, and lightweight state projectors that confine robot differences to the low-level interface (Zhang et al., 7 Jun 2026). Green-VLA formulates a unified $2$1-dimensional semantic action layout with embodiment masks and mapping functions $2$2 and $2$3, so that joint targets, Cartesian deltas, gripper commands, and mobile-base velocities share one semantic space across humanoids, mobile manipulators, and fixed-base arms (Apanasevich et al., 31 Jan 2026).

Temporal grounding is equally prominent. 4D-VLA introduces sequential RGB-D input, memory bank sampling, and relative temporal embeddings so that “where” and “when” are jointly represented (Zhang et al., 27 Jun 2025). ST-VLA replaces brittle $2$4D intermediate representations with a unified $2$5D–$2$6D interface $2$7, where $2$8 is an end-effector path and $2$9 is a smooth spatial mask built around a spatial tube $\rightarrow 3$ 0 (Wu et al., 14 Mar 2026). SSM-VLA addresses temporal fragility in latent action models by encoding multiple future keyframes jointly and reconstructing both RGB and depth, thereby encouraging latent actions to capture “what moves” and “where in space” over longer horizons (Cai et al., 30 Sep 2025).

This combination of spatial and action redesign indicates that pretraining is increasingly concerned with the geometry of control interfaces themselves, not only with augmenting perception. A plausible implication is that spatially aware VLA pretraining becomes more effective when the model’s output vocabulary or latent control space already factors motion in a geometrically meaningful manner.

6. Empirical trends, misconceptions, and unresolved issues

The empirical record shows consistent gains on geometry-sensitive tasks, but the mechanisms of improvement are not identical. SpatialVLA reports $\rightarrow 3$ 1 average success on LIBERO, with its strongest suite result on LIBERO-Spatial at $\rightarrow 3$ 2, and attributes substantial gains to Ego3D and Adaptive Action Grids (Qu et al., 27 Jan 2025). 4D-VLA reports $\rightarrow 3$ 3 average on LIBERO versus $\rightarrow 3$ 4 for OpenVLA, $\rightarrow 3$ 5 on unseen-view MV-Bench versus $\rightarrow 3$ 6 for OpenVLA, and $\rightarrow 3$ 7 on real-world tasks with its full pretrain+coord+hist configuration (Zhang et al., 27 Jun 2025). DepthVLA reports $\rightarrow 3$ 8 average success on LIBERO, $\rightarrow 3$ 9 on Simpler WidowX, and $2$0 versus $2$1 progress in real-robot tasks, with only $2$2 ms latency over $2$3 (Yuan et al., 15 Oct 2025). GeoAlign reports $2$4 on LIBERO, $2$5 across three SimplerEnv-Fractal tasks, and $2$6 on eight geometry-critical real-world ALOHA tasks (Chen et al., 2 Jun 2026). GST-VLA reports $2$7 on LIBERO and $2$8 on SimplerEnv, with ablations isolating contributions from Gaussian parameterization, attention pooling, CoT thoughts, and curriculum stages (Sarowar et al., 10 Mar 2026).

Several common misconceptions are challenged by these results. First, spatial awareness is not equivalent to merely concatenating a depth map. DepthVLA explicitly argues that it “does not merely concatenate a depth map to an image embedding,” but instead weaves geometric representations into every layer of a shared-attention MoT (Yuan et al., 15 Oct 2025). GST-VLA similarly argues that scalar depth lacks orientation and confidence information (Sarowar et al., 10 Mar 2026). Second, stronger spatial reasoning does not necessarily require explicit depth at inference time. Evo-0 uses frozen VGGT geometry with pure RGB input at deployment (Lin et al., 1 Jul 2025); GeoAlign is RGB-only at test time after geometry post-training (Chen et al., 2 Jun 2026); UniLACT transfers depth-aware latent priors into an RGB-language policy (Govind et al., 23 Feb 2026). Third, more direct geometry input is not always better. DepthVLA reports that feeding ground-truth depth instead of predicting it underperformed on LIBERO, $2$9 versus $3$00, and the authors attribute this to modality-competition effects (Yuan et al., 15 Oct 2025). Fourth, fine-tuning on spatial downstream tasks alone may not produce robust internal geometry. GASP shows that standard VLM internal correspondence matching accuracy can remain below $3$01, whereas pretraining on view-invariant correspondence and depth consistency raises peak layer-wise correspondence above $3$02 and improves downstream spatial benchmarks without any $3$03D VQA training (Yeh et al., 28 May 2026).

Limitations remain pronounced. Monocular depth still inherits failure modes on reflective, transparent, texture-less surfaces, and very fine edges, as noted by DepthVLA (Yuan et al., 15 Oct 2025). GeoAlign’s improvements depend on the quality of robot-domain RGB-D supervision during post-training (Chen et al., 2 Jun 2026). SA-VLA notes that Reach-Place-Leave reward shaping may not directly extend to deformable objects or fine-grained fingertip manipulation (Pan et al., 31 Jan 2026). Evo-0 depends on synchronized multi-view input and on the pretraining domain of VGGT (Lin et al., 1 Jul 2025). More generally, the field continues to trade off inference simplicity, annotation cost, geometric fidelity, and embodiment coverage.

Taken together, the literature indicates that spatial-aware VLA pretraining is moving from ad hoc depth augmentation toward a broader program of geometric induction. That program includes explicit visual-physical alignment from human videos, layerwise correspondence supervision, structured spatial thoughts, geometry-aware latent actions, state-conditioned geometry querying, and embodiment-canonical action interfaces. The unifying claim is not that any single spatial representation is universally optimal, but that VLA performance in open-world manipulation improves when geometric structure is embedded into the model before policy optimization must infer it implicitly from sparse task supervision.