TacThru-UMI: Multimodal Tactile-Visual Robotic Manipulation
- The paper introduces TacThru-UMI, a multimodal imitation learning framework that integrates tactile and visual feedback for robust robotic manipulation.
- It employs a novel See-Through-Skin sensor with keyline markers and Kalman filtering to achieve precise, low-latency tracking of tactile events.
- The approach leverages a Transformer-based diffusion policy for action generation, resulting in state-of-the-art success rates across diverse manipulation tasks.
TacThru-UMI is a multimodal imitation learning framework for robotic manipulation that fuses simultaneous tactile and visual information using a novel See-Through-Skin (STS) sensor (TacThru) and leverages a Transformer-based diffusion policy for action generation. Developed to address the limitations of conventional tactile-visual approaches—which often rely on alternating modes or suffer from unreliable contact perception—TacThru-UMI achieves state-of-the-art success rates across a diverse suite of contact-rich manipulation benchmarks. Its central innovation lies in the seamless integration of synchronized, high-fidelity tactile and visual signals with a suitable learning architecture, enabling robust policy learning and execution in challenging manipulation scenarios (Li et al., 10 Dec 2025).
1. Sensor Design and Signal Acquisition
The TacThru sensor is constructed as a fully transparent elastomer-based STS module incorporating persistent white LED illumination and 64 "keyline" concentric ring surface markers in an 8×8 grid (3.5 mm spacing, 40×40 mm area). The transparent silicone medium (Shore 00–10) enables the embedded miniature camera to capture both visual information from the workspace and direct evidence of contact events as changes in surface reflectance and marker displacement. The keyline design—featuring black inner and white outer rings with radii 0.6 mm and 1.0 mm, respectively—preserves marker visibility under arbitrary backgrounds, overcoming limitations of previous solid-paint marker techniques.
Marker locations are tracked at up to 120 Hz via Kalman filtering, using a linear Gaussian random-walk model:
with , , , and marker centroids associated with blob detection via nearest-neighbor matching to prior predictions. The displacement serves as the primary tactile cue. This approach yields robust, low-latency tracking even under challenging visual backgrounds, with a median frame processing time of 6.08 ms and negligible marker dropout.
2. Multimodal Representation and Policy Architecture
TacThru-UMI collects synchronized modalities at each decision step :
- Wrist-camera frames
- Sensor (close-up) frames
- Keyline marker deviations
- Robot proprioception 0 (end-effector pose, gripper width)
Visual frames are embedded using DINOv2 Vision Transformer (ViT) encoders (1-Base for wrist, 2; 3-Small for sensor, 4). Tactile and proprioceptive streams are processed via small MLPs. Each token receives a modality-specific learnable embedding and positional encoding. The concatenated sequence
5
conditions a Transformer-based diffusion policy 6.
Actions are parameterized as 7, 8, i.e., end-effector increments and gripper width targets. The policy is learned via conditional denoising diffusion, where the forward process adds noise to action sequences and the reverse process is modeled as
9
and trained with the standard 0 loss for denoising diffusion models.
3. Training Methodology and System Integration
TacThru-UMI demonstrations are collected using a UMI-compatible data collection rig, with time synchronization across all sensor streams and robust pose tracking via HTC Vive. Each manipulation task comprises 62–147 human demonstrations per task, and all data are stored in the Zarr format. Training employs the AdamW optimizer with initial learning rate 1, one-cycle scheduling, and default weight decay (2). Observation windows are as follows: 3 (wrist), 4 (sensor), 5 (proprioception), prediction horizon 6, and execution chunk 7. No auxiliary regularizers are used aside from weight decay, and off-the-shelf ViT encoders are used without fine-tuning for practical deployment.
4. Experimental Evaluation and Quantitative Results
TacThru-UMI was evaluated on five real-world tasks with randomized test seeds (20–24 seeds per task). Success criteria were task-specific and included basic pick-and-place (PickBottle), thin/soft object manipulation (PullTissue), visual discrimination (SortBolt), tactile discrimination (HangScissors), and multimodal fusion (InsertCap).
The table below summarizes success rates:
| Task | TT-M | TT | GS-M | Wrist |
|---|---|---|---|---|
| PickBottle | 97.5 ±2.1 | 96.3 ±3.0 | 95.8 ±3.5 | 95.0 ±3.7 |
| PullTissue | 88.0 ±4.0 | 60.5 ±5.2 | 10.0 ±4.5 | 12.5 ±5.0 |
| SortBolt | 90.0 ±3.3 | 85.0 ±4.1 | 45.0 ±6.1 | 38.0 ±5.5 |
| HangScissors | 82.5 ±4.7 | 80.0 ±5.0 | 83.3 ±4.3 | 35.0 ±6.2 |
| InsertCap | 90.0 ±3.2 | 75.0 ±5.8 | 40.0 ±6.5 | 30.0 ±7.0 |
| Avg. | 85.5 ±2.9 | 79.4 ±4.0 | 52.8 ±6.1 | 42.1 ±6.1 |
TT-M: TacThru-UMI with marker deviations; TT: ablation without marker deviations; GS-M: alternating tactile–visual with GelSight Mini; Wrist: vision-only. Paired 8-tests confirm the superiority of simultaneous TT-M over GS-M and Wrist (p < 0.01). Ablation studies show ∼6% drop with marker removal, and a ∼12% drop if keyline markers are replaced with solid markers.
Notably, TacThru-UMI enables robust fallback in scenarios where conventional tactile-only information is insufficient (e.g., PullTissue tasks), with the visual stream supporting success when contact forces are below sensor thresholds.
5. Baseline Analysis and Comparative Performance
TacThru-UMI outperforms baseline architectures:
- Alternating tactile-visual (GS-M) struggles on tasks requiring fluid transition between modalities and fails on thin/soft object extraction (PullTissue, InsertCap fallback).
- Vision-only (Wrist) exhibits poor performance on contact-driven tasks (HangScissors, InsertCap), lacking reliable indicators of forceful or occluded contacts.
- Removal of marker deviations (TT) reduces task success by ∼6%, establishing their necessity for precise policy learning and execution.
This architecture leverages adaptive multimodal strategies: the Transformer-based diffusion policy can switch between vision and tactile guidance as context demands, e.g., using vision for alignment and tactile for post-occlusion adjustment in InsertCap.
6. Technical Insights and Impact
TacThru-UMI validates that true simultaneous tactile–visual perception (enabled by transparent elastomer fingertips, persistent illumination, keyline markers, and efficient tracking) with a modern diffusion-based Transformer policy yields significant advances in both basic and fine-grained robotic manipulation. The system demonstrates direct applicability of off-the-shelf, pre-trained visual encoders (DINOv2 ViT), lowering the integration barrier into vision-based pipelines.
TacThru-UMI excels in scenarios classically challenging for robotic systems: contact detection with thin/soft objects, precision insertions, and manipulation under visual occlusion. The findings suggest that the explicit fusion of tactile and visual feedback at every timestep allows robust, responsive policy execution that adapts dynamically to emergent task conditions (Li et al., 10 Dec 2025).
7. Limitations and Prospective Extensions
TacThru-UMI’s principal limitation lies in its reliance on highly controlled tactile marker tracking and synchronization, which may be sensitive to marker occlusion or degradation over prolonged use. Some tasks demonstrate reduced performance when marker features are unavailable or ambiguous. A plausible implication is that future systems might require self-calibrating or self-healing marker arrays or fusion with other contact modalities to ensure robustness.
The extension to more varied manipulation challenges (e.g., deformable or fragile objects, and long-horizon contact-intensive sequences) is a proposed future direction. Potential includes leveraging TacThru-UMI’s architecture for on-policy reinforcement and learning from demonstration at scale, broadening its application beyond imitation learning frameworks.