OmniViTac: 360° Visuo-Tactile Sensing
- OmniViTac is a framework and dataset that fuses 360° visual and tactile perception for contact-rich robotic manipulation.
- It integrates biomimetic sensor design with synchronized multimodal data streams to enable precise closed-loop control across diverse tasks.
- It sets new benchmarks by combining advanced deep learning methods like TactileVAE and GANs with large-scale, well-annotated manipulation trajectories.
OmniViTac refers to a framework and dataset dedicated to the fusion of omnidirectional visual and tactile sensing for contact-rich robotic manipulation. It encompasses both the design of sensor hardware—drawing on principles established by ViTacTip and related tactile vision modalities—and the curation of a large-scale benchmark dataset for learning and evaluating closed-loop, visuo-tactile control policies. OmniViTac establishes new standards for scale, task diversity, and multimodal perception in contact-centric robotics (Zheng et al., 19 Mar 2026, Fan et al., 2024).
1. Mechanical and Sensing Architecture
OmniViTac sensor design synthesizes physical and algorithmic advances aimed at seamless 360° visuo-tactile perception. The proposed architecture is extrapolated from ViTacTip principles, employing a geodesic dome of transparent elastomer (Agilus30 Clear, Shore A ≈ 30, optical transmission >90%), which encloses a polyhedral core. Distributed across the dome's facets are marker-bearing biomimetic tips arrayed in a hexagonal lattice, each with a black disc fiducial for deformation tracking. Embedded behind each facet is a miniature camera module (e.g., ELP USBFHD06H-L180, 1920×1080 px, ~120° FoV) with dedicated LED ring illumination (Fan et al., 2024).
During operation, external lighting and internal LED arrays are alternated to allow for both "see-through-skin" visual imaging and high-contrast tactile imaging. The system supports direct object texture imaging through the skin as well as fine spatial registration of local contact via marker displacement.
Key mechanical parameters:
- Skin thickness: 1.0–1.5 mm for robustness/transparency trade-off.
- Dome curvature: radius ≈ 15 mm.
- Biomimetic tip: 0.5 mm shaft, 3 mm height, 0.8 mm marker, ~2 mm pitch.
- Camera arrangement: ≥6 per dome for omnidirectional coverage.
2. Dataset Composition and Structure
The OmniViTac dataset comprises 21,879 robot- and human-demonstrated manipulation trajectories, spanning 86 distinct tasks and 126 unique objects. The protocol systematically covers six physics-grounded interaction patterns:
- Assembly
- Cutting
- Adjustment (in-hand reorientation)
- Peeling
- Wiping
- Grasping
Each task is instantiated across five semantic scenarios (Kitchen, Fruit Shop, Industrial, Chemistry Lab, Office). Per-pattern trajectory counts range from 1,000 (Assembly, Cutting) to 9,200 (Grasping). Data streams include:
- Vision: synchronized RGB-D at 30 Hz
- Tactile: 2D marker displacement fields at up to 60 Hz; per-frame H×W×3 arrays
- Proprioception/actions: joint states (up to 60 Hz), gripper aperture, 3D end-effector deltas
Timestamp alignment ensures ≤10 ms synchronization error. Data is validated by periodic human review and automatic checks throughout collection (Zheng et al., 19 Mar 2026).
3. Sensing Modalities and Self-Supervised Encoding
OmniViTac leverages tightly integrated vision and tactile modalities:
- Visual input: wide-field, see-through-skin imaging for distal texture and geometry
- Tactile input: marker displacement tracking to yield 3D surface and shear/normal force stimuli
Self-supervised tactile encoding employs a causal 3D-conv variational autoencoder (TactileVAE), compressing temporal-spatial marker displacements into latent tensors . Reconstruction is achieved using an implicit neural representation: where is positional encoding and is an MLP. Training minimizes a VAE loss incorporating reconstruction and KL divergence regularization (Zheng et al., 19 Mar 2026).
A GAN-based approach (conditional Pix2Pix) is used for disentangling and modality switching between vision and tactile streams. Generators (U-Net) and discriminators (PatchGAN) are trained with adversarial and losses, computing marker-free or marker-enhanced output as required (Fan et al., 2024).
4. Interaction Patterns, Task Definitions, and Annotation
Six core manipulation patterns are defined:
- Assembly: Multi-directional tight-tolerance insertion, demanding precise force/position coupling.
- Cutting: Progressive normal force with detection of cut completion by force drop.
- Adjustment: Force-torque control for in-hand object reorientation and slip correction.
- Peeling: Coupled normal/shear to strip surface layers.
- Wiping: Planar shear and pressure modulation to counter frictional drag.
- Grasping: Control of normal/shear for securing and manipulating articulated or deformable objects.
Contact force is not measured directly but proxied by marker displacement magnitude , while shear evolution encodes frictional state. Effective Contact Ratio and pattern-level contact area metrics quantify the extent and quality of contact across trajectories. Tasks are further annotated by temporal-difference and amplitude-based dynamic weights during modeling (Zheng et al., 19 Mar 2026).
5. Data Collection, Preprocessing, and Quality Assurance
Acquisition employs both a 7-DoF UFACTORY xArm-7 (parallel-jaw gripper with interchangeable modules) and a handheld TacUMI device. Demonstrations are cross-embodiment, using isomorphic end-effectors. Recording is managed via foot pedal control, regular interface resets, and human-in-the-loop curation (random review of trajectories and full offline validation). Preprocessing includes trimming static frames, timestamp alignment, and clip segmentation.
An envisioned OmniViTac sensor would leverage multi-sided transparent skin and a distributed camera array, with inter-facet calibration via shared markers to maintain unified spatial registration (Fan et al., 2024).
6. Benchmarks, Evaluation Protocols, and Performance
OmniViTac supports benchmarking across tactile reconstruction (/cosine similarity), multi-horizon tactile prediction, latent diffusion loss, and dynamic-aware loss. Policy learning is evaluated via action diffusion loss and a 60 Hz reflexive recovery controller loss: with 0 the human annotation of recovery actions under abnormal contact conditions.
OmniVTA, a learned world-model framework built atop OmniViTac, demonstrates 80–90% real-robot task success rates on seen objects and up to 83% on novel/generalized settings, outperforming eight baselines under task diversity, generalization, and perturbation. Inference timings distinguish between slow (230–480 ms) and fast (3.5 ms/60 Hz) closed-loop policies (Zheng et al., 19 Mar 2026).
7. Implications and Extensions
The OmniViTac paradigm, by extending ViTacTip's see-through-skin, biomimetic-tip, and GAN-based modality-switching principles to omnidirectional configurations, enables full 360° visuo-tactile sensing. Multi-view conditional GANs are proposed for panoramic, multi-facet data fusion, with 3D-aware adversarial losses for seamless output. Future designs may employ softer skin materials (e.g., Ecoflex), micro-fisheye cameras, and multimodal transformer models for robust, contact-shape-aware perception across complex object geometries (Fan et al., 2024).
A plausible implication is that the OmniViTac blueprint supports scalable, generalizable learning for contact-rich manipulation, addressing limitations in prior datasets and methods by coupling high-fidelity physical interaction logging with advanced multimodal representation learning.