OmniViTac: 360° Visuo-Tactile Sensing

Updated 3 July 2026

OmniViTac is a framework and dataset that fuses 360° visual and tactile perception for contact-rich robotic manipulation.
It integrates biomimetic sensor design with synchronized multimodal data streams to enable precise closed-loop control across diverse tasks.
It sets new benchmarks by combining advanced deep learning methods like TactileVAE and GANs with large-scale, well-annotated manipulation trajectories.

OmniViTac refers to a framework and dataset dedicated to the fusion of omnidirectional visual and tactile sensing for contact-rich robotic manipulation. It encompasses both the design of sensor hardware—drawing on principles established by ViTacTip and related tactile vision modalities—and the curation of a large-scale benchmark dataset for learning and evaluating closed-loop, visuo-tactile control policies. OmniViTac establishes new standards for scale, task diversity, and multimodal perception in contact-centric robotics (Zheng et al., 19 Mar 2026, Fan et al., 2024).

1. Mechanical and Sensing Architecture

OmniViTac sensor design synthesizes physical and algorithmic advances aimed at seamless 360° visuo-tactile perception. The proposed architecture is extrapolated from ViTacTip principles, employing a geodesic dome of transparent elastomer (Agilus30 Clear, Shore A ≈ 30, optical transmission >90%), which encloses a polyhedral core. Distributed across the dome's facets are marker-bearing biomimetic tips arrayed in a hexagonal lattice, each with a black disc fiducial for deformation tracking. Embedded behind each facet is a miniature camera module (e.g., ELP USBFHD06H-L180, 1920×1080 px, ~120° FoV) with dedicated LED ring illumination (Fan et al., 2024).

During operation, external lighting and internal LED arrays are alternated to allow for both "see-through-skin" visual imaging and high-contrast tactile imaging. The system supports direct object texture imaging through the skin as well as fine spatial registration of local contact via marker displacement.

Key mechanical parameters:

Skin thickness: 1.0–1.5 mm for robustness/transparency trade-off.
Dome curvature: radius ≈ 15 mm.
Biomimetic tip: 0.5 mm shaft, 3 mm height, 0.8 mm marker, ~2 mm pitch.
Camera arrangement: ≥6 per dome for omnidirectional coverage.

2. Dataset Composition and Structure

The OmniViTac dataset comprises 21,879 robot- and human-demonstrated manipulation trajectories, spanning 86 distinct tasks and 126 unique objects. The protocol systematically covers six physics-grounded interaction patterns:

Assembly
Cutting
Adjustment (in-hand reorientation)
Peeling
Wiping
Grasping

Each task is instantiated across five semantic scenarios (Kitchen, Fruit Shop, Industrial, Chemistry Lab, Office). Per-pattern trajectory counts range from 1,000 (Assembly, Cutting) to 9,200 (Grasping). Data streams include:

Vision: synchronized RGB-D at 30 Hz
Tactile: 2D marker displacement fields at up to 60 Hz; per-frame H×W×3 arrays
Proprioception/actions: joint states (up to 60 Hz), gripper aperture, 3D end-effector deltas

Timestamp alignment ensures ≤10 ms synchronization error. Data is validated by periodic human review and automatic checks throughout collection (Zheng et al., 19 Mar 2026).

3. Sensing Modalities and Self-Supervised Encoding

OmniViTac leverages tightly integrated vision and tactile modalities:

Visual input: wide-field, see-through-skin imaging for distal texture and geometry
Tactile input: marker displacement tracking to yield 3D surface and shear/normal force stimuli

Self-supervised tactile encoding employs a causal 3D-conv variational autoencoder (TactileVAE), compressing temporal-spatial marker displacements $X\in\mathbb{R}^{T\times H\times W\times3}$ into latent tensors $\mathbf{z}_t$ . Reconstruction is achieved using an implicit neural representation: $\mathbf{d}(\mathbf{x}) = \mathcal{D}_\theta(\gamma(\mathbf{x}), \Phi(\mathbf{z}_t, \mathbf{x}))$ where $\gamma$ is positional encoding and $\mathcal{D}_\theta$ is an MLP. Training minimizes a VAE loss incorporating $\ell_2$ reconstruction and KL divergence regularization (Zheng et al., 19 Mar 2026).

A GAN-based approach (conditional Pix2Pix) is used for disentangling and modality switching between vision and tactile streams. Generators (U-Net) and discriminators (PatchGAN) are trained with adversarial and $L_1$ losses, computing marker-free or marker-enhanced output as required (Fan et al., 2024).

4. Interaction Patterns, Task Definitions, and Annotation

Six core manipulation patterns are defined:

Assembly: Multi-directional tight-tolerance insertion, demanding precise force/position coupling.
Cutting: Progressive normal force with detection of cut completion by force drop.
Adjustment: Force-torque control for in-hand object reorientation and slip correction.
Peeling: Coupled normal/shear to strip surface layers.
Wiping: Planar shear and pressure modulation to counter frictional drag.
Grasping: Control of normal/shear for securing and manipulating articulated or deformable objects.

Contact force is not measured directly but proxied by marker displacement magnitude $\|\mathbf{d}(\mathbf{x})\|_2$ , while shear evolution encodes frictional state. Effective Contact Ratio and pattern-level contact area metrics quantify the extent and quality of contact across trajectories. Tasks are further annotated by temporal-difference and amplitude-based dynamic weights during modeling (Zheng et al., 19 Mar 2026).

5. Data Collection, Preprocessing, and Quality Assurance

Acquisition employs both a 7-DoF UFACTORY xArm-7 (parallel-jaw gripper with interchangeable modules) and a handheld TacUMI device. Demonstrations are cross-embodiment, using isomorphic end-effectors. Recording is managed via foot pedal control, regular interface resets, and human-in-the-loop curation (random review of trajectories and full offline validation). Preprocessing includes trimming static frames, timestamp alignment, and clip segmentation.

An envisioned OmniViTac sensor would leverage multi-sided transparent skin and a distributed camera array, with inter-facet calibration via shared markers to maintain unified spatial registration (Fan et al., 2024).

6. Benchmarks, Evaluation Protocols, and Performance

OmniViTac supports benchmarking across tactile reconstruction ( $L_2$ /cosine similarity), multi-horizon tactile prediction, latent diffusion loss, and dynamic-aware loss. Policy learning is evaluated via action diffusion loss and a 60 Hz reflexive recovery controller loss: $\mathcal{L}_{\mathrm{RLTC}} = \|a_r - \hat a_r\|_2^2$ with $\mathbf{z}_t$ 0 the human annotation of recovery actions under abnormal contact conditions.

OmniVTA, a learned world-model framework built atop OmniViTac, demonstrates 80–90% real-robot task success rates on seen objects and up to 83% on novel/generalized settings, outperforming eight baselines under task diversity, generalization, and perturbation. Inference timings distinguish between slow (230–480 ms) and fast (3.5 ms/60 Hz) closed-loop policies (Zheng et al., 19 Mar 2026).

7. Implications and Extensions

The OmniViTac paradigm, by extending ViTacTip's see-through-skin, biomimetic-tip, and GAN-based modality-switching principles to omnidirectional configurations, enables full 360° visuo-tactile sensing. Multi-view conditional GANs are proposed for panoramic, multi-facet data fusion, with 3D-aware adversarial losses for seamless output. Future designs may employ softer skin materials (e.g., Ecoflex), micro-fisheye cameras, and multimodal transformer models for robust, contact-shape-aware perception across complex object geometries (Fan et al., 2024).

A plausible implication is that the OmniViTac blueprint supports scalable, generalizable learning for contact-rich manipulation, addressing limitations in prior datasets and methods by coupling high-fidelity physical interaction logging with advanced multimodal representation learning.

Markdown Report Issue Upgrade to Chat

References (2)

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation (2026)

ViTacTip: Design and Verification of a Novel Biomimetic Physical Vision-Tactile Fusion Sensor (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniViTac.