Cross-Modal Visuo-Tactile Object Perception

Published 2 Apr 2026 in cs.RO and cs.LG | (2604.02108v1)

Abstract: Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a cross-modal latent filtering approach that integrates vision and tactile data for unsupervised, robust object property inference.
It employs dynamic Bayesian filtering and structured latent spaces to improve inference speed and accuracy compared to static fusion methods.
Experimental results with dual-arm robots and synthetic objects demonstrate resilience to noise and effective adaptive cross-modal integration.

Motivation and Biological Foundations

Robust perception in robotics, especially during contact-rich manipulation with non-rigid objects, necessitates the estimation of latent intrinsic (e.g., mass, stiffness, friction) and extrinsic (e.g., shape, size, visual texture) physical properties. Human sensorimotor systems integrate vision and touch through dynamic Bayesian inference, exploiting statistical regularities and context-sensitive cross-modal priors. Existing robotic frameworks predominantly favor static fusion or unidirectional transfer, lacking dynamic uncertainty modeling and causal latent state-space structure. The paper proposes a Cross-Modal Latent Filter (CMLF) inspired by human multi-sensory processing, designed for unsupervised learning of structured latent spaces from raw visuo-tactile streams.

Figure 1: The CMLF framework leverages cross-modal Bayesian integration, enabling bidirectional priors across vision and touch for unsupervised object property estimation.

Framework Architecture: CMLF Design

The CMLF employs a deep state-space model partitioning latent variables into directly and indirectly observable factors, grounded in control theory observability. Separate structured latent spaces for vision and touch, coupled via shared dynamics models, facilitate bidirectional cross-modal priors. Visual representations encode extrinsic properties; tactile representations capture complex intrinsic attributes inaccessible by vision.

Figure 2: CMLF architecture with bidirectional cross-modal connectivity; cross-modal priors are selectively exchanged for robust causal inference.

Sequential Bayesian filtering, enhanced by hierarchical object-centric priors, supports continuous update of beliefs as new sensor data arrives. Modalities remain autonomous, but prior information from one can dynamically constrain inference in the other via uncertainty-weighted coupling. Cross-modal connections are only activated post stabilization of unimodal latent structures, mirroring developmental neurobiological principles.

Experimental Setup and Dataset

A dual-arm robotic platform, instrumented with stereo vision and tactile sensors, interacts with systematically designed synthetic non-rigid objects. Intrinsic and extrinsic properties are controllable and associated through explicit causal correlations. Data includes rich visuo-tactile interaction streams across 75 objects and multiple interaction primitives (palpation, grasp, lift, rotate), capturing geometric and contact dynamics.

Figure 3: Experimental pipeline for dataset generation and visuo-tactile exploration of configurable objects.

Inference Efficiency and Latent Space Analysis

CMLF was benchmarked against sequential variational autoencoder baselines and ablations. Latent discriminative structure was assessed via logistic regression and KRR alignment to ground-truth properties.

Figure 4: Classification and regression metrics reveal CMLF's superior inference efficiency for intrinsic and extrinsic properties; cross-modal priors accelerate convergence and reduce error.

Results demonstrate significant improvement in inference speed and accuracy for intrinsic properties when cross-modal priors from vision are available. Extrinsic inference is dominated by vision early, but tactile priors enhance robustness under ambiguity. The hierarchical latent space structure affords interpretable alignment of latent variables with physical object parameters.

Robustness to Noise and Corruption

Perturbation studies with additive Gaussian noise and random observation dropout show that CMLF is resilient to sensory degradation, in line with inverse effectiveness in multisensory biology. Cross-modal pathways act as latent backups: tactile priors improve extrinsic inference under visual corruption, and vice versa.

Figure 5: Visualization of different noise and corruption levels applied to visuo-tactile streams for robustness evaluation.

Quantitative metrics confirm that cross-modal coupling yields consistently lower prediction error across all property dimensions under noisy and incomplete input conditions compared to unimodal or joint-space baselines.

By delaying activation of cross-modal priors until unimodal latent stabilization, the model achieves improved generalization and faster convergence, mirroring developmental alignment processes in associative cortex. When exposed to "surprise" objects with inverted extrinsic-intrinsic correlations, CMLF exhibits Bayesian-like perceptual bias analogous to human multi-sensory illusions (e.g., size-weight).

Figure 6: CMLF's perceptual behavior under surprise objects demonstrates biological-like Bayesian bias and gradual correction; delayed cross-modal activation improves learning.

The model's sequential Bayesian process gradually updates biased priors toward ground truth as more evidence accrues. Unlike humans, CMLF is less capable of rapid online adaptation when cross-modal illusions occur, highlighting a key limitation and a direction for neuro-inspired algorithmic advances.

Practical and Theoretical Implications

CMLF offers a principled pathway for uncertainty-aware, context-sensitive perceptual inference in contact-rich robotics. The structured latent partitioning and probabilistic cross-modal coupling improve both efficiency and robustness, providing actionable priors for anticipatory control, haptic exploration, and manipulation policy learning. The parallels to cortical associative motifs suggest that computational principles from neuroscience can significantly inform robotic perceptual architecture. Notably, the framework highlights the need for adaptive gating mechanisms to prevent maladaptive transfer when cross-modal correlations break down, and for causal reasoning to enable selective integration only when a common cause exists.

Conclusion

The cross-modal latent filtering approach advances unsupervised inference of object properties by formalizing causal, object-centric latent spaces and Bayesian cross-modal integration. Empirical results validate improved efficiency, robustness, and biological plausibility compared to static fusion baselines. The developmental and perceptual parallels invite further exploration of online adaptive priors and causal gating, potentially bridging the gap between machine perception and human sensorimotor intelligence.

Markdown Report Issue