TacThru-UMI: Multimodal Tactile-Visual Robotic Manipulation

Updated 2 July 2026

The paper introduces TacThru-UMI, a multimodal imitation learning framework that integrates tactile and visual feedback for robust robotic manipulation.
It employs a novel See-Through-Skin sensor with keyline markers and Kalman filtering to achieve precise, low-latency tracking of tactile events.
The approach leverages a Transformer-based diffusion policy for action generation, resulting in state-of-the-art success rates across diverse manipulation tasks.

TacThru-UMI is a multimodal imitation learning framework for robotic manipulation that fuses simultaneous tactile and visual information using a novel See-Through-Skin (STS) sensor (TacThru) and leverages a Transformer-based diffusion policy for action generation. Developed to address the limitations of conventional tactile-visual approaches—which often rely on alternating modes or suffer from unreliable contact perception—TacThru-UMI achieves state-of-the-art success rates across a diverse suite of contact-rich manipulation benchmarks. Its central innovation lies in the seamless integration of synchronized, high-fidelity tactile and visual signals with a suitable learning architecture, enabling robust policy learning and execution in challenging manipulation scenarios (Li et al., 10 Dec 2025).

1. Sensor Design and Signal Acquisition

The TacThru sensor is constructed as a fully transparent elastomer-based STS module incorporating persistent white LED illumination and 64 "keyline" concentric ring surface markers in an 8×8 grid (3.5 mm spacing, 40×40 mm area). The transparent silicone medium (Shore 00–10) enables the embedded miniature camera to capture both visual information from the workspace and direct evidence of contact events as changes in surface reflectance and marker displacement. The keyline design—featuring black inner and white outer rings with radii 0.6 mm and 1.0 mm, respectively—preserves marker visibility under arbitrary backgrounds, overcoming limitations of previous solid-paint marker techniques.

Marker locations $x_t \in \mathbb{R}^2$ are tracked at up to 120 Hz via Kalman filtering, using a linear Gaussian random-walk model:

$x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$

with $A = H = I_2$ , $w_t \sim \mathcal{N}(0, \sigma_w^2 I_2)$ , $v_t \sim \mathcal{N}(0, \sigma_v^2 I_2)$ , and marker centroids associated with blob detection via nearest-neighbor matching to prior predictions. The displacement $\Delta x_t = \hat x_t - \hat x_0$ serves as the primary tactile cue. This approach yields robust, low-latency tracking even under challenging visual backgrounds, with a median frame processing time of 6.08 ms and negligible marker dropout.

2. Multimodal Representation and Policy Architecture

TacThru-UMI collects synchronized modalities at each decision step $t$ :

Wrist-camera frames $\mathbf{I}_w^t$
Sensor (close-up) frames $\mathbf{I}_s^t$
Keyline marker deviations $\Delta\mathbf{x}^t$
Robot proprioception $x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 0 (end-effector pose, gripper width)

Visual frames are embedded using DINOv2 Vision Transformer (ViT) encoders ( $x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 1-Base for wrist, $x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 2; $x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 3-Small for sensor, $x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 4). Tactile and proprioceptive streams are processed via small MLPs. Each token receives a modality-specific learnable embedding and positional encoding. The concatenated sequence

$x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 5

conditions a Transformer-based diffusion policy $x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 6.

Actions are parameterized as $x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 7, $x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 8, i.e., end-effector increments and gripper width targets. The policy is learned via conditional denoising diffusion, where the forward process adds noise to action sequences and the reverse process is modeled as

$x_t = A\,x_{t-1} + w_t,\quad z_t = H\,x_t + v_t,$ 9

and trained with the standard $A = H = I_2$ 0 loss for denoising diffusion models.

3. Training Methodology and System Integration

TacThru-UMI demonstrations are collected using a UMI-compatible data collection rig, with time synchronization across all sensor streams and robust pose tracking via HTC Vive. Each manipulation task comprises 62–147 human demonstrations per task, and all data are stored in the Zarr format. Training employs the AdamW optimizer with initial learning rate $A = H = I_2$ 1, one-cycle scheduling, and default weight decay ( $A = H = I_2$ 2). Observation windows are as follows: $A = H = I_2$ 3 (wrist), $A = H = I_2$ 4 (sensor), $A = H = I_2$ 5 (proprioception), prediction horizon $A = H = I_2$ 6, and execution chunk $A = H = I_2$ 7. No auxiliary regularizers are used aside from weight decay, and off-the-shelf ViT encoders are used without fine-tuning for practical deployment.

4. Experimental Evaluation and Quantitative Results

TacThru-UMI was evaluated on five real-world tasks with randomized test seeds (20–24 seeds per task). Success criteria were task-specific and included basic pick-and-place (PickBottle), thin/soft object manipulation (PullTissue), visual discrimination (SortBolt), tactile discrimination (HangScissors), and multimodal fusion (InsertCap).

The table below summarizes success rates:

Task	TT-M	TT	GS-M	Wrist
PickBottle	97.5 ±2.1	96.3 ±3.0	95.8 ±3.5	95.0 ±3.7
PullTissue	88.0 ±4.0	60.5 ±5.2	10.0 ±4.5	12.5 ±5.0
SortBolt	90.0 ±3.3	85.0 ±4.1	45.0 ±6.1	38.0 ±5.5
HangScissors	82.5 ±4.7	80.0 ±5.0	83.3 ±4.3	35.0 ±6.2
InsertCap	90.0 ±3.2	75.0 ±5.8	40.0 ±6.5	30.0 ±7.0
Avg.	85.5 ±2.9	79.4 ±4.0	52.8 ±6.1	42.1 ±6.1

TT-M: TacThru-UMI with marker deviations; TT: ablation without marker deviations; GS-M: alternating tactile–visual with GelSight Mini; Wrist: vision-only. Paired $A = H = I_2$ 8-tests confirm the superiority of simultaneous TT-M over GS-M and Wrist (p < 0.01). Ablation studies show ∼6% drop with marker removal, and a ∼12% drop if keyline markers are replaced with solid markers.

Notably, TacThru-UMI enables robust fallback in scenarios where conventional tactile-only information is insufficient (e.g., PullTissue tasks), with the visual stream supporting success when contact forces are below sensor thresholds.

5. Baseline Analysis and Comparative Performance

TacThru-UMI outperforms baseline architectures:

Alternating tactile-visual (GS-M) struggles on tasks requiring fluid transition between modalities and fails on thin/soft object extraction (PullTissue, InsertCap fallback).
Vision-only (Wrist) exhibits poor performance on contact-driven tasks (HangScissors, InsertCap), lacking reliable indicators of forceful or occluded contacts.
Removal of marker deviations (TT) reduces task success by ∼6%, establishing their necessity for precise policy learning and execution.

This architecture leverages adaptive multimodal strategies: the Transformer-based diffusion policy can switch between vision and tactile guidance as context demands, e.g., using vision for alignment and tactile for post-occlusion adjustment in InsertCap.

6. Technical Insights and Impact

TacThru-UMI validates that true simultaneous tactile–visual perception (enabled by transparent elastomer fingertips, persistent illumination, keyline markers, and efficient tracking) with a modern diffusion-based Transformer policy yields significant advances in both basic and fine-grained robotic manipulation. The system demonstrates direct applicability of off-the-shelf, pre-trained visual encoders (DINOv2 ViT), lowering the integration barrier into vision-based pipelines.

TacThru-UMI excels in scenarios classically challenging for robotic systems: contact detection with thin/soft objects, precision insertions, and manipulation under visual occlusion. The findings suggest that the explicit fusion of tactile and visual feedback at every timestep allows robust, responsive policy execution that adapts dynamically to emergent task conditions (Li et al., 10 Dec 2025).

7. Limitations and Prospective Extensions

TacThru-UMI’s principal limitation lies in its reliance on highly controlled tactile marker tracking and synchronization, which may be sensitive to marker occlusion or degradation over prolonged use. Some tasks demonstrate reduced performance when marker features are unavailable or ambiguous. A plausible implication is that future systems might require self-calibrating or self-healing marker arrays or fusion with other contact modalities to ensure robustness.

The extension to more varied manipulation challenges (e.g., deformable or fragile objects, and long-horizon contact-intensive sequences) is a proposed future direction. Potential includes leveraging TacThru-UMI’s architecture for on-policy reinforcement and learning from demonstration at scale, broadening its application beyond imitation learning frameworks.

Markdown Report Issue Upgrade to Chat

References (1)

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TacThru-UMI.