Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Task Visuo-Tactile World Models

Updated 3 July 2026
  • Multi-Task VT-WMs integrate high-frequency tactile and global visual inputs to create predictive representations for contact-rich robotic manipulation.
  • They use modality-specific encoders with late fusion and GRU or transformer-based dynamics models to maintain spatial and temporal consistency.
  • Empirical evidence shows VT-WMs achieving higher planning success, improved physical fidelity, and better generalization than vision-only counterparts.

Multi-Task Visuo-Tactile World Models (VT-WM) are a class of deep sequence models that integrate both visual and tactile sensory modalities to learn predictive representations of contact-rich dynamics in robotic manipulation. By capturing spatial and temporal structure across multiple tasks, and optimally fusing high-frequency touch with global vision, they enable physically faithful forward imagination and robust planning in environments where visual observations alone are insufficient. VT-WMs are empirically shown to outperform vision-only counterparts, substantially improving metrics such as physical fidelity in imagination, planning success rates, and generalization across diverse manipulation tasks and sensor modalities (Zhang et al., 11 Jun 2026, Higuera et al., 5 Feb 2026, Zheng et al., 19 Mar 2026).

1. Sensory Fusion: Representational Principles and Architectures

Multi-task VT-WMs employ encoders for both visual and tactile data streams. At each timestep tt, a visual observation xtvx^v_t (e.g., RGB image, depth, or point cloud) is mapped through a parameterized encoder ϕv\phi_v to a latent ztvRdz^v_t \in \mathbb{R}^d, while a tactile observation xttx^t_t (e.g., tactile RGB, depth, or force field) is mapped via ϕt\phi_t to zttRdz^t_t \in \mathbb{R}^d (Zhang et al., 11 Jun 2026). Architectures include:

Fusion strategies predominantly use late concatenation of ztvz^v_t and zttz^t_t into ztR2dz_t \in \mathbb{R}^{2d}, followed by a unified GRU or transformer-based latent dynamics predictor. Early fusion variants such as FiLM, gating, and cross-attention were evaluated but did not outperform late concatenation, particularly when ensuring modality compatibility (e.g., TacFF with point clouds) (Zhang et al., 11 Jun 2026).

2. Multi-Task Data Regimes and Training Methodologies

Multi-task VT-WMs are trained end-to-end over large, heterogenous robotic datasets, drawing minibatches uniformly from a pool of contact-rich manipulation demonstrations spanning many object types and task structures (Zheng et al., 19 Mar 2026). Key examples include:

Dataset Tasks Trajectories Sensors
ContactWorld 12 ≈200/demo, 7-24k Wrist, front, point cloud; TacRGB, TacDepth, FF
OmniViTac/OmniVTA 86 21,879 Wrist, 3rd-person RGB-D; 4 tactile sensor types
VT-WM (Higuera et al., 5 Feb 2026) 8 124 Wrist RGB, tactile images (vision-based)

Uniform sampling across tasks and timesteps during stochastic gradient descent encourages shared latent dynamics and perceptual representations across insertion, disassembly, screwing, exploratory interaction, cutting, adjustment, and assembly, as well as generalization to unseen objects (Zhang et al., 11 Jun 2026, Zheng et al., 19 Mar 2026).

3. Latent Dynamics, Losses, and Regularization

All VT-WM variants utilize action-conditioned latent dynamics models for temporal prediction. Typical modules include a GRU (Zhang et al., 11 Jun 2026) or a multi-layer autoregressive transformer (Higuera et al., 5 Feb 2026), generative diffusion transformer (Zheng et al., 19 Mar 2026), and explicit treatment of spatial and temporal continuity.

  • Predictive dynamics: xtvx^v_t0
  • JEPA-style dynamics loss: xtvx^v_t1 (Zhang et al., 11 Jun 2026)
  • Latent L1 losses: e.g., xtvx^v_t2 and xtvx^v_t3 sum L1 between predicted and ground-truth latents for each modality (Higuera et al., 5 Feb 2026)
  • Optional reconstruction xtvx^v_t4, but most variants omit pixel-level image reconstruction in favor of regularized latent objectives
  • Regularizers: VICReg variance-covariance on visual latents, temporal-similarity, inverse-dynamics, and, for diffusion models, dynamic- and amplitude-aware weighted losses (Zhang et al., 11 Jun 2026, Zheng et al., 19 Mar 2026)

Ablation studies demonstrate the necessity of (1) explicit spatial structure (e.g., point cloud and TacFF over image-based latents), (2) temporally smooth embeddings, and (3) careful modality pairing and regularization—e.g., vision-only regularization combined with modality-appropriate fusion (Zhang et al., 11 Jun 2026).

4. Multimodal Compatibility and Representation Analysis

Performance and planning robustness in VT-WM hinge on the compatibility of visual and tactile encodings. Structured point cloud representations, when fused with force-field tactile (TacFF) features that share a grid/geometry structure, exhibit superior cross-modal alignment, yielding average planning success rates up to 36.1% compared to 20.7–22.0% for image-based inputs (Zhang et al., 11 Jun 2026). Conversely, image-like tactile signals (TacRGB/Depth) align better with image views but degrade compatibility with point-cloud models.

Regularization further interacts with modality: vision-only variance-covariance (VICReg) regularization on visual latents and no regularization on tactile latents ("vision-only reg") outperforms joint regularization (Zhang et al., 11 Jun 2026). Early fusion mechanisms (FiLM, cross-attention) did not systematically outperform the default concatenation.

5. Planning, Control, and Long-Horizon Robustness

VT-WMs are used for goal-conditioned planning and control, typically employing receding-horizon Model Predictive Control (MPC) with the Cross-Entropy Method (CEM), optimized in latent space (Zhang et al., 11 Jun 2026, Higuera et al., 5 Feb 2026):

  1. Encode current and goal observations to latents.
  2. Sample action sequences, rollout latent transitions for horizon xtvx^v_t5.
  3. Score by latent goal proximity: xtvx^v_t6; update action distribution via cross-entropy.
  4. Execute the first action of the best sequence; replan.

OmniVTA augments this with a 60 Hz Reflexive Latent Tactile Controller (RLTC), closing the loop and correcting for prediction/model drift between anticipated and observed tactile signals (Zheng et al., 19 Mar 2026).

Long-horizon performance: All models degrade as xtvx^v_t7 increases, but explicit structure and tactile inputs substantially slow performance loss. For example, in ContactWorld, point clouds degrade more gracefully (success 52.1% → 16.0%, as xtvx^v_t8 increases from 12 to 48), while wrist images degrade faster (41.4% → 8.2%) (Zhang et al., 11 Jun 2026). The addition of TacFF to point clouds further reduces multistep rollout error growth.

6. Physical Fidelity, Generalization, and Adaptation

VT-WMs explicitly address failures of vision-only models in maintaining object permanence and respecting contact-induced physical constraints. Physical fidelity is quantified via normalized Fréchet distance between predicted and actual object trajectories, achieving a 33% improvement in object permanence and 29% in causal compliance over vision-only baselines (Higuera et al., 5 Feb 2026). Grounding in tactile signals translates into higher zero-shot success: in real-robot tasks, VT-WM outperforms baselines by 10–35% success (e.g., up to 85% vs 50% in a two-stage reach & push task) (Higuera et al., 5 Feb 2026). OmniVTA demonstrates 80% average task success across six manipulation tasks in highly diverse, object-rich settings; ablative removal of tactile-closed-loop components drops success to 66% (open loop) and lower with vision-only baselines (Zheng et al., 19 Mar 2026).

Generalization and adaptation: Training on mixtures of tasks and object classes enables VT-WM to generalize to unseen objects, unseen tool variants, and novel manipulation settings (e.g., cutting with a new knife, unseen object heights with 58% success compared to 38% for reactive dynamical policies) (Zheng et al., 19 Mar 2026). Fine-tuning with few demonstrations enables adaptation to novel tasks, yielding 77% zero-shot success after 5k gradient steps on small demonstration pools (Higuera et al., 5 Feb 2026).

7. Key Takeaways, Limitations, and Open Challenges

Key findings for multi-task VT-WM design (Zhang et al., 11 Jun 2026, Higuera et al., 5 Feb 2026, Zheng et al., 19 Mar 2026):

  1. Preserve explicit spatial structure and temporal continuity in both modalities (point cloud + TacFF).
  2. Cross-modal encoding compatibility is more critical than scale or encoder family alone.
  3. Simple late fusion, regularize vision only, and avoid over-regularization of tactile latents.
  4. Reflexive high-frequency correction and anticipatory tactile reasoning are essential for long-horizon robustness and contact uncertainty.
  5. Multi-task world modeling enables rapid adaptation and transfers contact-dynamics knowledge across previously unseen tasks and domains.

Limitations include reliance on vision-based tactile sensors, open-loop planning costs (except where reflexive correction is incorporated), and limited explicit object-centric decoding. Open areas include scaling to further sensor modalities (e.g., force-torque, audio), hierarchically modeling very long horizons, and real-world continual learning for robustness to sensor wear and dynamic change (Zheng et al., 19 Mar 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Task Visuo-Tactile World Models (VT-WM).