Tactile Token Representations

Updated 19 March 2026

Tactile token representations are discrete or continuous encodings derived from high-dimensional sensor data that facilitate tactile perception, memory, and control in robotic systems.
They employ varied tokenization strategies—such as spatial grids, temporal vectors, graphs, and learned embeddings—to capture contact-rich features and support safety-critical reasoning.
Integrating tactile tokens into multimodal architectures like transformers and neuromorphic models enhances cross-modal retrieval, robust manipulation, and invariant texture recognition.

Tactile token representations are discrete or continuous vector encodings derived from high-dimensional tactile sensor data—such as spatially distributed taxel readings, force-torque arrays, or contact images—designed for compatibility with contemporary machine learning architectures. They serve as the substrate for perception, memory, control, and cross-modal integration in robotic manipulation, tactile understanding, and multimodal learning systems. Tactile tokens can be structured (as grids, vectors, graphs, or spike trains) or unstructured (as learned embeddings); they may be task-agnostic or distilled to capture contact-rich, safety-critical tactile reasoning. Recent advances have established tactile tokens as a central abstraction enabling both explicit hardware coupling (direct touch) and implicit tactile-aware reasoning in architectures operating without live haptic feedback.

1. Tokenization Strategies: From Raw Sensors to Structured Representations

Tokenization is the process of transforming high-dimensional tactile data into discrete or continuous representations suitable for downstream computation. According to Albini et al. (Albini et al., 12 Oct 2025), canonical token structures include:

Temporal Vectors: Each taxel signal $\mathbf{s}_k(t)$ over a window forms a 1D token; feature-based methods apply statistics, derivatives, or frequency transforms for slip or contact event detection.
Spatial Grids (Images): Taxel values are mapped to $H\times W$ image grids, optionally projected or interpolated if the sensor layout is non-planar; used for convolutional neural networks as in T-Dex (Guzey et al., 2023).
Point Clouds: When taxel locations are calibrated, tokens may be (3D position, value) pairs, allowing geometric feature extraction and spatial reasoning.
Graphs/Meshes: Taxels are nodes, with edges encoding spatial adjacency. Graph neural network methods require explicit mesh-structured tokens for relational reasoning.
Voxels/Volumes: 3D embedding of local contact data, binning taxel readings into a volumetric grid—suitable for complex shape or compliance estimation.
Learned Embeddings: Autoencoders or self-organizing maps compress tactile input into lower-dimensional codes for tasks such as unsupervised clustering or cross-robot transfer.

The transformation $T = W\mathbf{s}$ , with $W \in \mathbb{R}^{m \times n}$ , is a general schema for aggregating $n$ -taxel readings into $m$ tokens per time step. The structure of $W$ (sparse, block, learned) and post-processing (pooling, normalization, masking) are adapted to both sensor and task.

2. Learned Token Embeddings: Supervised and Self-Supervised Paradigms

Recent models focus on learning tactile tokens as compact, information-rich embeddings. The T-Dex pipeline (Guzey et al., 2023) employs a convolutional architecture (AlexNet through FC7, yielding 4096-dimensional tokens) trained via self-supervised Bootstrap Your Own Latent (BYOL) losses. Input tactile data, reshaped as images from 15 × 4 × 4 × 3 = 720-channel uSkin pads, is up-sampled, augmented, and encoded, with the resulting tokens capturing spatial contact patterns, slip, and micro-force cues.

In UniTouch (Yang et al., 2024), patch tokens from a 24-layer ViT ( $C=1024$ ) are prepended with learnable, sensor-specific prefix tokens ( $L=5$ per sensor), forming input sequences $[T_s; P(x_t)]$ . These are processed into global tactile embeddings $E_t\in\mathbb{R}^{1024}$ , which are aligned to pretrained vision-language embedding spaces (CLIP/ImageBind) via symmetric InfoNCE contrastive loss. This approach enables zero-shot cross-modal retrieval, question answering, and tactile-to-image or tactile-to-language generation, and supports simultaneous multi-sensor alignment.

HapticVLA (Gubernatorov et al., 16 Mar 2026) introduces a 128-dimensional tactile token $f\in\mathbb{R}^{128}$ , encapsulating gripper contact features such as force, pressure peak, slip, and center-of-pressure shifts via a dual-tactile encoder. This token is used in a teacher-student distillation regime: the teacher’s tactile states (proprioception + $f$ ) inform a vision-language-action transformer, after which a student without access to tactile signals predicts action flows implicitly incorporating tactile reasoning.

3. Architecture and Fusion: Transformers, Masked Modeling, and Multimodal Integration

Tactile tokens are integrated into a variety of backbone architectures:

Transformers with Multi-Modal Fusion: In TacVLA (Zhang et al., 13 Mar 2026), 15×8 taxel pressure maps are flattened, embedded through a two-layer MLP into 36 tactile tokens (matching backbone dimensionality, e.g., $d=768$ ), receive positional encodings, and are concatenated with vision and language tokens for cross-modal attention in a frozen Vision-LLM (e.g., SigLIP-based, 24 layers). Contact-aware gating ensures tactile tokens are only present during meaningful contact phases, empirically avoiding confusion in the non-contact regime.
Masked Modeling for Spatio-Temporal Understanding: The MAT $^3$ architecture (Kamijo et al., 27 Jan 2026) encodes distributed 3D tactile arrays, actions, and auxiliary state streams as separate per-timestep tokens, each with spatial and temporal positional codes. Masked predictions reconstruct missing taxel readings, actions, or auxiliary states from context, encouraging robust, distributed feature learning. The final tactile representation is pooled over tokens and used for retrieval-based control.
Tactile Distillation and Action Matching: HapticVLA (Gubernatorov et al., 16 Mar 2026) distills the action behavior of a tactile-aware teacher policy into a student policy that operates solely from vision and proprioception. This approach does not explicitly regress the internal tactile token but ensures the downstream action space captures contact-based corrections and safety constraints.

Model	Token Dim. / Type	Integration/Fusion	Losses/Objectives
T-Dex	$4096$ (AlexNet FC7)	Nearest-neighbor retrieval w/ visual tokens	BYOL self-supervised
UniTouch	$1024$ (ViT global)	Prepend sensor-prefix; ViT encoder; cross-modal alignment	Symmetric InfoNCE contrastive
TacVLA	36 × $d$ per timestep	Contact-aware gated concat; Transformer fusion	Flow-matching (MSE)
MAT $^3$	10 × (256) per step	Hard/soft fusion; Masked modeling; spatiotemporal context	Masked prediction (MSE)
HapticVLA	$128$ (dual-encoder)	State projection; action-flow distillation	Reward-weighted flow match, distillation

4. Neuromorphic and Invariant Representations

Biologically inspired schemes, such as those introduced by Iskarous et al. (Iskarous et al., 2024), encode tactile data in spike-time representations to improve invariance and robustness. The processing pipeline includes:

Force-Invariance Module: Per-taxel scaling normalizes analog sensor data so that slowly-adapting (SA) neuron firing rates are invariant to contact force. Empirical coefficients $C_{t,i,j,k}$ are determined to align spike rates across force conditions.
Spiking Activity Encoding: Each taxel drives Izhikevich model neurons (SA and RA types), mapping force signals to spike trains through membrane potential dynamics.
Speed-Invariance Module: Post-encoding, spike times are rescaled according to the observed scan velocity, yielding feature vectors invariant to both speed and force conditions.
Feature Extraction and Classification: Spike-train-based features are concatenated, projected via PCA, and classified with linear discriminants.

This methodology yields texture representations with high performance in both offline and real-time robotic systems.

5. Empirical Findings and Design Considerations

Ablation and benchmarking studies converge on several principles:

Token Structure and Dimensionality: Rich spatial structure (proper gridding, positional embedding) and moderate dimensionality (36–128 tokens/timestep) improve dexterous manipulation compared to naïve flattening or overly large token sets (Zhang et al., 13 Mar 2026, Kamijo et al., 27 Jan 2026, Guzey et al., 2023).
Contact-Aware Gating: Active suppression of tactile tokens outside periods of contact sharply improves task success rates and prevents detrimental cross-modal interference (Zhang et al., 13 Mar 2026).
Distillation of Tactile Reasoning: Implicitly learning tactile-aware action corrections via distillation, without explicit tactile inputs at inference, provides performance and safety boosts even over baselines with direct tactile feedback (Gubernatorov et al., 16 Mar 2026).
Regularization and Robustness: Masked modeling, self-supervised pretraining, and contrastive objectives encourage tokens to encode robust, transferable touch representations, supporting zero-shot and cross-modal generalization (Yang et al., 2024, Guzey et al., 2023).
Hardware–Representation Coupling: The optimal tokenization strategy and data structure (vector, grid, graph, embedding) are dictated by underlying sensor layout, compliance properties, and the needs of control or perceptual subsystems (Albini et al., 12 Oct 2025).

6. Applications and Future Directions

Tactile tokens underpin diverse applications including:

Robust Manipulation under Occlusion: Integration into VLA frameworks facilitates manipulation in visually ambiguous or contact-rich scenarios (Zhang et al., 13 Mar 2026, Gubernatorov et al., 16 Mar 2026).
Multimodal Retrieval and Generation: Learned embeddings permit cross-modal matching (touch-to-image, touch-to-language, touch-to-sound) and conditional generation tasks (Yang et al., 2024).
Tactile Memory and Imitation: Spatiotemporal tokenization supports retrieval-based object insertion, key turning, and robust insertion under uncertainty (Kamijo et al., 27 Jan 2026, Guzey et al., 2023).
Invariant Texture Recognition: Neuromorphic pipelines deliver robustness to variable scanning conditions, supporting prosthesis feedback and robust surface identification (Iskarous et al., 2024).
Design Guidelines: Representation flexibility is essential; multiple token structures may be maintained in parallel for different stages of a task pipeline (Albini et al., 12 Oct 2025).

A plausible implication is the further expansion of tactile tokenization methods into areas such as whole-body robotic tactile skins, semantic touch-language associations, and closed-loop tactile feedback for prosthetics and human-robot interaction.

7. Summary Table: Tactile Tokenization Schemes

Approach	Tokenization Pipeline	Core Modality/Architecture
T-Dex	Tactile image → AlexNet → 4096D/256D token	Self-supervised, Nearest-neighbor
HapticVLA	Dual-Pad maps → Encoder → 128D token, state-projected	VLA transformer action distillation
UniTouch	Patch + sensor tokens → ViT → 1024D global embedding	Multi-modal, Contrastive, ViT
TacVLA	Grid flatten → MLP → 36 tokens (+pos.) → Contact gate	Cross-modal transformer
MAT $^3$	Taxel+state→linear→token (+pos.) → Masked Transform	Spatiotemporal transformer
Neuromorphic	Analog→Spike (Izhikevich)→Feature (PCA)	Spiking features, LDA
Classical (review)	Vectors, grids, point clouds, meshes, maps, embeddings	Feature/classification/control

This survey demonstrates that tactile token representations are foundational abstractions facilitating the fusion, transfer, and deployment of tactile information across learning, reasoning, and control scenarios. Their careful design—matched to task, architecture, and hardware—enables robust, generalizable manipulation and perceptual intelligence in complex physical environments (Gubernatorov et al., 16 Mar 2026, Zhang et al., 13 Mar 2026, Kamijo et al., 27 Jan 2026, Yang et al., 2024, Albini et al., 12 Oct 2025, Guzey et al., 2023, Iskarous et al., 2024).