Multi-Viewpoint Latent Action Model (MVP-LAM)

Updated 16 May 2026

The paper introduces a multi-view training mechanism that disentangles viewpoint-specific features to produce compact, action-centric latent codes.
It employs synchronized multi-view encoding, spatio-temporal transformers, and VQ-VAE quantization to align representations across perspectives.
Empirical results show that MVP-LAM enhances transferability, generalization, and robustness in robotic tasks compared to scene-centric approaches.

The Multi-Viewpoint Latent Action Model (MVP-LAM) encompasses a family of approaches in which action representations are learned by leveraging synchronized multi-view observational data in order to disentangle viewpoint-specific artifacts from agent-intrinsic action dynamics. MVP-LAM methods underpin state-of-the-art robot learning by producing action-centric latent codes from diverse visual sequences, directly enhancing transfer, generalization, and downstream Vision-Language-Action (VLA) policy performance (Lee et al., 3 Feb 2026, Xiao et al., 12 May 2026, Jeong et al., 6 Jan 2026). In MVP-LAM, a latent action is defined as a compact, usually discrete, code that summarizes the underlying agent's action in a view-invariant manner, typically by jointly encoding visual transitions from multiple spatial perspectives.

1. Definition and Theoretical Foundations

MVP-LAM defines a mapping from sequences of high-dimensional, time-synchronized visual inputs $I^\nu_t$ (from multiple viewpoints $\nu$ ) to discrete or continuous latent action representations $z_t$ that are maximally informative about the underlying ground-truth action $A_t$ . The core insight is that, by forcing a latent code extracted from one viewpoint to predict, reconstruct, or align with future observations from another viewpoint, the learned codes prioritize agent-centric dynamics over viewpoint-specific features. This reduces sensitivity to nuisance variables such as camera angle, occlusion, and background.

Formally, let $o_t = f(I_t) \in \mathbb{R}^d$ be a visual feature embedding (e.g., from a frozen DINOv2 model), and let $E_\theta$ denote a spatio-temporal encoder mapping paired features $(o_t, o_{t+H})$ to an intermediate embedding $e_t$ . A (vector-quantized) discrete codebook is used to obtain $z_t = \text{Quantize}(e_t) \in \{1,\ldots,K\}^L \times \mathbb{R}^{d_\text{VQ}}$ , where $L$ is the code length and $\nu$ 0 is the dictionary size. In MVP-LAM, the overall pipeline encourages $\nu$ 1 to be maximally informative ( $\nu$ 2) about $\nu$ 3 across multiple views, while remaining minimally contingent on any particular viewpoint (Lee et al., 3 Feb 2026).

2. Core Methodological Components

MVP-LAM methods typically comprise the following interacting components:

Multi-view visual feature encoding: Frame features are extracted per camera via a frozen backbone (e.g., DINOv2, CNNs).
Spatio-temporal encoding: For each transition $\nu$ 4, a Transformer-based encoder generates a transition embedding $\nu$ 5.
Latent code quantization: A VQ-VAE style quantizer maps $\nu$ 6 into discrete codes, imposing quantization and commitment losses:

$\nu$ 7

with $\nu$ 8 (Lee et al., 3 Feb 2026).

Cross-viewpoint reconstruction: Given features from two views $\nu$ 9, the model is trained such that the latent action inferred from view $z_t$ 0 can reconstruct the future feature in view $z_t$ 1 using a shared decoder $z_t$ 2:

$z_t$ 3

$z_t$ 4

The full objective is:

$z_t$ 5

Mutual information estimation: The informativeness and action-centricity of the learned latents is measured using estimators such as KSG, MINE, and the Barber-Agakov bound.
Action Manifold Learning (AML) module (Xiao et al., 12 May 2026): An alternative implementation predicts actions directly on a low-dimensional manifold, leveraging multi-view diffusion-generated latent priors and a geometry-guided gated transformer for 3D-aware fusion.

3. Multi-Viewpoint Cross-View Training and Variants

The principal distinguishing feature of MVP-LAM is its cross-viewpoint training strategy. Unlike prior scene-centric multi-view representation learning, MVP-LAM aligns or reconstructs action-relevant transitions across different camera perspectives. This approach includes:

Cross-view action transfer: Training such that a latent action inferred from one camera enables accurate prediction, decoding, or synthesis of future frames from another camera (Lee et al., 3 Feb 2026).
Action-guided contrastive losses: Latents are aligned across views experiencing the same ground-truth actions using weighted InfoNCE (Jeong et al., 6 Jan 2026).
Geometry-guided fusion: Instead of only transferring visual transitions, geometry modules leverage synthesized multi-view latent priors (via diffusion), facilitating depth disambiguation and occlusion robustness (Xiao et al., 12 May 2026).

The table summarizes methodological variants:

Method	Latent Type	Cross-View Supervision	Notable Components
MVP-LAM (Lee et al., 3 Feb 2026)	Discrete	Cross-viewpoint feature recon.	VQ-VAE, joint decoder
VILA (Jeong et al., 6 Jan 2026)	Continuous	Action-guided latent alignment	IDM/FDM, InfoNCE, struct.
MVP-LAM/AML (Xiao et al., 12 May 2026)	Manifold	Geometry-guided fusion, AML	G³T, VAE-DiT, AML loss

4. Network Architectures and Implementation

MVP-LAM instantiations combine robust vision backbones with spatio-temporal sequence modeling:

Visual encoders: DINOv2 for frame embedding (output dimension 768), CNN-MLP for alternative implementations.
Temporal encoders: 12-block Transformers process concatenated spatio-temporal patches to produce embedding $z_t$ 6.
Latent codebooks: Discrete VQ-VAE with $z_t$ 7 entries, $z_t$ 8 tokens per transition, $z_t$ 9.
Shared decoders: Decoders operate autoregressively in feature space, without explicit camera-pose conditioning, enforcing invariance.
AML/G³T backbone: Geometry-guided Gated Transformers (G³T) align and fuse monocular and multi-view latent tokens, gate occlusion noise, and refine 3D geometry consistency; action manifold decoders (DiT-style) directly sample action chunks on low-dimensional manifolds (Xiao et al., 12 May 2026).

All components are shared across camera views, with no explicit pose signal provided during token generation or decoding. In practice, training integrates time-synchronized multi-view robot/human sequences, large-scale frozen backbone features, and batch-wise optimizer updates (AdamW, LR=1e-4, weight decay=1e-2).

5. Evaluation Metrics and Empirical Results

MVP-LAM's effectiveness is assessed by measuring the information carried by latents about ground-truth actions, as well as downstream policy performance and robustness. Key metrics and results include:

Mutual Information ( $A_t$ 0):
- KSG estimator: MVP-LAM achieves $A_t$ 11.10 bits on Bridge V2 versus $A_t$ 20.67 (UniVLA), $A_t$ 30.50 (LAPA), $A_t$ 40.46 (Moto), with all methods having $A_t$ 5 14 bits (Lee et al., 3 Feb 2026).
- BA and MINE estimators concur in ranking.
Linear Probing and OOD Generalization:
- Linear probe NMSE on held-out tasks and out-of-distribution (OOD) suites (LIBERO-Long, SIMPLER) is minimized by MVP-LAM, indicating strong action prediction fidelity.
Downstream VLA Pretraining and Policy Performance:
- Pretraining a large VLM (e.g., Prismatic-7B) with MVP-LAM pseudo-labels yields superior manipulation success (SIMPLER: 60.4% vs 39.6% [UniVLA]; LIBERO-Long: 90.8% vs 79.4% [UniVLA]) (Lee et al., 3 Feb 2026).
- On LIBERO-Plus (perturbed), MVP-LAM maintains 85.7% average success with only 12.9% degradation, outperforming alternative approaches by 7–16% (Xiao et al., 12 May 2026).
- Real-robot evaluation demonstrates high task completion compared to OpenVLA-OFT and other baselines (Xiao et al., 12 May 2026).
Ablation and Robustness: Absence of cross-view losses or human data sharply lowers MI and task performance, confirming the necessity of multi-view, action-centric objectives. Zero-shot perturbations by novel view synthesis minimally degrade MVP-LAM latent consistency (DINOv2-MSE $A_t$ 6), whereas LAPA/Moto exhibit larger drops.

View-Invariant Latent Action (VILA) (Jeong et al., 6 Jan 2026) is a notable MVP-LAM instance. In VILA:

Latent actions $A_t$ 7 are extracted from view-specific embeddings using an inverse dynamics model $A_t$ 8.
View-invariance is enforced by aligning latents according to the ground-truth action similarity, via weighted InfoNCE and structural alignment.
Latent policies $A_t$ 9 are trained to predict future action latents from the current frame alone, decoupling perception and control.
Experimental results show state-of-the-art performance on unseen-view and unseen-task generalization, with 75%–95% relative performance retention in both simulation and real-robot settings.

MVP-LAM may be further extended via probabilistic filtering on explicit latent states $o_t = f(I_t) \in \mathbb{R}^d$ 0 (with transition/proposal inference), conditioning on camera pose, or broadening alignment to include other nuisance factors (illumination, object appearance).

7. Significance and Future Directions

MVP-LAM provides a principled architecture for extracting robot-usable, compact, action-centric representations from diverse multi-view data, in particular by leveraging unlabelled human videos to generalize beyond robot-embodiment datasets. The cross-viewpoint mechanism specifically equips downstream VLA models with superior viewpoint, occlusion, and embodiment transferability. The general MVP-LAM paradigm can be instantiated via cross-view feature reconstruction (Lee et al., 3 Feb 2026), action-guided alignment (Jeong et al., 6 Jan 2026), or geometry-guided action manifold learning (Xiao et al., 12 May 2026), offering flexibility as well as empirical superiority over classical scene-centric or single-view learning approaches.

A plausible implication is that as multi-view capture and view synthesis technologies mature, the MVP-LAM framework will continue to improve robotic policy generalization and facilitate data-efficient real-world adaptation. Additionally, extensions incorporating explicit, probabilistic latent state transition models or joint end-to-end policy learning are suggested as promising research avenues (Jeong et al., 6 Jan 2026).