TaMeSo-bot: Tactile Memory in Soft Robotics

Updated 3 February 2026

The paper introduces TaMeSo-bot, which integrates a soft compliant wrist with a masked tactile trajectory transformer to enable robust contact-rich manipulation.
The methodology employs a retrieval-based control policy that leverages a high-frequency tactile memory database to execute safe object insertion under diverse conditions.
Experimental results demonstrate improved generalization with up to 85% success on unseen peg geometries and significant gains in handling physical disturbances.

Tactile Memory with Soft Robot (TaMeSo-bot) is a robotic manipulation system designed to enable robust, safe, and adaptive contact-rich manipulation in uncertain environments. The system combines a physically compliant soft wrist with a masked-encoding Transformer architecture—Masked Tactile Trajectory Transformer (MAT³)—and a retrieval-based control policy leveraging a non-parametric tactile memory database. Through this integration, TaMeSo-bot achieves high performance in object insertion tasks, particularly in scenarios with novel shapes and disturbances, by effectively storing, retrieving, and generalizing from multimodal touch-based experience (Kamijo et al., 27 Jan 2026).

1. Mechanical and Sensor Architecture

TaMeSo-bot is built on a UR5e manipulator equipped with a modular spring-based compliant wrist, a distributed tactile sensor array, and a comprehensive suite of proprioceptive and exteroceptive sensors. The soft wrist comprises three parallel coil springs (spring constant 1.448 N/mm, equilibrium length 25 mm) providing approximately six degrees of passive compliance (translational and rotational deflection in response to small forces). This compliance enables the robot to absorb misalignment and facilitate gentle contact exploration, protecting both the end-effector and the tactile sensor array during uncertain contact events.

The tactile sensor, PapillArray ("Contactile"), provides a 3×3 grid of taxels at the fingertip, each outputting a 3D contact force vector (27 scalars per time step). Further sensor modalities include a 6-axis force/torque sensor at the wrist, joint encoders from the UR5e (6 joint angles), and a 6D pose estimate from an HTC VIVE Tracker mounted on the gripper. All sensor streams are sampled and synchronized at 50 Hz.

2. Data Collection and Tactile Memory Database

Data is collected via teleoperation using an HTC VIVE controller, where a human demonstrator executes key insertion tasks from four evenly spaced initial positions around a fixed hole. Each demonstration spans 150–300 steps (3–6 seconds) and is safely executed thanks to the compliance of the soft wrist, even under initial misalignment.

Trajectories $\tau = \{(s_t, a_t)\}_{t=1}^T$ are segmented into overlapping sub-trajectories of fixed length $H$ (with $H=10$ –$20$), and each window is encoded to a $d$ -dimensional vector $z_t = E(s_{t-H+1:t}, a_{t-H+1:t})$ via the MAT³ encoder. The tuple $(z_t, a_t)$ is stored in a non-parametric database $D$ , forming a tactile memory that links local spatiotemporal tactile context with suitable action increments.

At test time, the system retrieves actions by encoding the most recent time window with the current action position masked, generating a query vector $z_q$ . $z_q$ is matched to the $k$ -nearest keys in $D$ (L₂ distance, HNSW graph indexing), and a corresponding stored action $a_i$ is randomly chosen and executed. This retrieval-based policy ensures the robot only executes actions similar to those proven safe and effective in prior demonstrations.

3. Masked Tactile Trajectory Transformer (MAT³)

MAT³ is a Transformer-based encoder that processes the multimodal, spatiotemporal sequence of tactile, proprioceptive, and force/torque signals, alongside robot action sequences. At each timestep, the input includes 9 tactile taxel readings ( $S_t^{tac} \in \mathbb{R}^{3×9}$ ), robot action ( $a_t \in \mathbb{R}^6$ ), 6-axis force/torque vector, arm joint angles, and gripper pose.

Input tokens per timestep consist of 10 "base" tokens (9 taxel tokens plus the action token). Each base token is enriched by a learned, weighted sum of auxiliary modalities and a fixed spatial embedding $y^i \in \mathbb{R}^{d_\textrm{pos}}$ for the taxel grid location, with temporal position encoded via sinusoidal embeddings. The overall input embedding per token $x_t^i$ (Eq. 1) is:

$x_t^i = [b_t^i + w^{ft} e_t^{ft} + w^{arm} e_t^{arm} + w^{grip} e_t^{grip} + w^{time} e_t^{time}] \oplus y^i,$

where $b_t^i$ is a learnable projection of the sensory or action input, $w^\cdot$ are learned modality-fusion weights, and $\oplus$ denotes concatenation.

The Transformer encoder includes four layers, each with eight attention heads and a hidden dimension of 512. The full input spans $10 \times H$ tokens with a total embedding dimension of 256. During training, a random subset of the tokens (ratio $r \sim$ Uniform[0,0.6]) is masked, and MAT³ is optimized (using mean-squared error) to jointly reconstruct masked sensory and action tokens: \begin{align*} L &= L_{tactile} + L_{action}\ L_{tactile} &= \frac{1}{|\mathcal{B}|} \sum_{(t,i) \in \mathcal{M} \cap {\text{tac}}} |\hat{s}tⁱ - s_t^{i|^2\} L{action} &= \frac{1}{|\mathcal{B}|} \sum_{t \in \mathcal{M} \cap {\text{act}}} |\hat{a}_t - a_t|² \end{align*} where $\mathcal{M}$ is the set of masked tokens.

The final local context vector $z_t$ used for retrieval is obtained by average-pooling the Transformer outputs:

$z_t = \frac{1}{10 H}\sum_{i=1}^{10}\sum_{\tau=t-H+1}^{t} \mathrm{TransformerOutput}_\tau^i$

This approach enables automatic extraction of spatially and temporally structured features relevant to contact-rich tasks, in contrast to single-step attention or context-agnostic architectures.

4. Memory-Based Control Policy and Adaptation

The online control loop operates at 50 Hz, maintaining a moving window of the past $H$ timesteps' tokens with the current action masked. MAT³ encodes this context to produce a query vector $z_q$ , which is used to seek similar experiences from the database $D$ via approximate nearest-neighbor search. The retrieved key–action pairs correspond to successful prior demonstrations; the system randomly samples one action and outputs the corresponding Cartesian displacement increment to the UR5e controller.

This policy enables adaptive, history-conditioned trajectory following without explicit subtask segmentation or online reinforcement learning. The reliance on stored demonstrations as the source of candidate actions enforces safety constraints, while the soft wrist absorbs any dissipation due to small context mismatches. The combination of memory-based recall and mechanical compliance yields robust performance when faced with novel geometries or environmental uncertainty.

5. Experimental Validation and Benchmarking

Real-robot evaluations were conducted on peg-in-hole tasks involving seven peg geometries (two for training: square and 40 mm cylinder; five for testing: 30 mm cylinder, rectangle, oval, hexagon, pentagon), with a 2 mm hole tolerance. Additional test cases included increased friction (via rubber tape) and a 5° tilt in grasp orientation.

Comparative baselines were: a Tactile Transformer model (no spatial embedding/masking) adapted from prior work, and an ablation of MAT³ without input masking. Parameters aligned with MAT³: four layers, eight heads, hidden dimension 512, token embedding 248 + 8 positional = 256, mask ratio 0–0.6, batch size 128, 30 epochs.

Performance metrics demonstrate that MAT³ substantially surpasses the baselines, particularly in generalization to unseen conditions. Below are the quantitative results (all claims and numbers exactly as reported):

Success Rates by Peg Type

Shape	Tactile Transformer	MAT³ w/o Mask	MAT³
Seen (totals)	22.5% (18/80)	90% (72/80)	88.8% (71/80)
Unseen (totals)	17.5% (35/200)	71% (142/200)	85% (170/200)

Success Rates under Unseen Conditions

Condition	Tactile Transformer	MAT³ w/o Mask	MAT³
Unseen Starting Poses	5% (2/40)	22.5% (9/40)	50% (20/40)
Increased Friction	5% (2/40)	17.5% (7/40)	50% (20/40)
5° Tilted Grasp	12.5% (5/40)	47.5% (19/40)	72.5% (29/40)
Overall	7.5% (9/120)	29.2% (35/120)	57.5% (69/120)

Key observations include the following: the same spatiotemporal Transformer (even without masking) greatly outperforms a single-step, cross-modal attention baseline. Masked modeling further improves generalization, yielding +14% success on unseen pegs and +28% in perturbed conditions over MAT³ w/o Mask. Even in familiar conditions, MAT³ maintains an approximate 90% success rate absent subtask segmentation or online adaptation.

6. Analysis of Results and Case Studies

Anecdotal analysis underscores MAT³’s adaptation capability. On the 30 mm cylinder—a test case with novel dimensions—the system rapidly retrieves locally relevant align–insert demonstrations, effecting successful insertion despite the unseen geometry. Under tilted grasps (5° offset), the compliance of the soft wrist combined with tactile memory retrieval often produces corrective roll movements, enabling a 72.5% completion rate, in contrast to jamming or slipping in baseline methods. These outcomes highlight the synergy of learned masked representations with mechanical robustness in handling variance and failure modes not seen during training.

7. Core Contributions and Significance

TaMeSo-bot demonstrates a practical method for embedding and retrieving tactile experience in contact-rich manipulation, with three principal innovations:

Integration of a 6-DOF soft wrist with distributed tactile sensing enables safe, compliant exploration and manipulation under pose and contact uncertainty.
The Masked Tactile Trajectory Transformer (MAT³) architecture jointly models spatial, temporal, and cross-modal information, autonomously extracting task-relevant features via masked prediction—without the need for explicit stage labels or hand-crafted subtask segmentation.
A lightweight memory-retrieval policy leverages the MAT³ representations to generalize monolithic demonstrations across previously unseen shapes and environmental perturbations.

Empirically, TaMeSo-bot achieves approximately 85% success on out-of-distribution peg geometries and 58% success under combined physical disturbances, greatly exceeding prior transformer-based and ablated approaches (Kamijo et al., 27 Jan 2026). This suggests a promising pathway for robust, sample-efficient, and general manipulation policy design through the fusion of physical compliance and masked memory-augmented representation learning.

Markdown Report Issue Upgrade to Chat

References (1)

Tactile Memory with Soft Robot: Robust Object Insertion via Masked Encoding and Soft Wrist (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tactile Memory with Soft Robot (TaMeSo-bot).