Heterogeneous Tactile Transformer (HTT)

Updated 4 July 2026

Heterogeneous Tactile Transformer (HTT) is a framework that maps diverse tactile sensor outputs into a shared latent space via modality-specific encoders and a common Transformer trunk.
HTT employs paired cross-sensor pretraining and masked autoencoding to align tactile representations, improving performance on tasks like object classification and slip detection.
The approach enhances tactile perception and manipulation by learning sensor-agnostic features, enabling robust transfer across heterogeneous sensors.

Heterogeneous Tactile Transformer (HTT) denotes a Transformer-centered approach to tactile representation learning in which heterogeneous tactile observations are mapped by sensor-appropriate front ends into a shared latent space and then processed by a common Transformer backbone. In the explicit sense, HTT is the framework introduced in “Heterogeneous Tactile Transformer,” which combines sensor-specific encoders, a shared transformer trunk, and paired cross-sensor pretraining over vision- and array-based tactile data (Bi et al., 29 Jun 2026). In a broader technical sense, the term also describes a family of architectures that address tactile heterogeneity by learning invariances in latent space rather than by forcing raw-signal alignment, as seen in sensor-invariant optical tactile learning and multi-sensor multi-task tactile pretraining (Gupta et al., 27 Feb 2025, Zhao et al., 2024).

1. Heterogeneity as the central problem

HTT is motivated by the fact that tactile sensors are intrinsically heterogeneous. Optical or vision-based tactile sensors produce image-like observations of elastomer deformation and marker motion, with rich spatial geometry but sensor-dependent optics, illumination, elastomer properties, and packaging. Array-based tactile sensors instead produce taxel or force/pressure time series with different spatial resolution, temporal bandwidth, and signal statistics. Even within a nominal sensor family, manufacturing variation and calibration differences can induce substantial domain shift, so that the same physical contact may yield markedly different raw observations across devices (Bi et al., 29 Jun 2026, Gupta et al., 27 Feb 2025).

This heterogeneity is not limited to hardware alone. The broader tactile literature also emphasizes heterogeneity across form factors, marker versus markerless gels, camera placement and number, RGB versus monochrome illumination, and task distributions ranging from single-frame classification to temporal slip detection and long-horizon policy conditioning. These factors make raw-image alignment ill-defined and undermine the reuse of per-sensor encoders or task heads across datasets and platforms (Zhao et al., 2024).

Within this setting, HTT formalizes a sensor-agnostic learning problem: retain contact-relevant structure while suppressing modality- or sensor-specific nuisance factors. In the optical setting, Sensor-Invariant Tactile Representation (SITR) explicitly frames the goal as learning a latent $z$ that preserves information needed for downstream tasks while being invariant to the sensor domain $d$ ; conceptually, an HTT seeks the same outcome across broader tactile families (Gupta et al., 27 Feb 2025).

2. Core architectural pattern

The canonical HTT architecture is organized around modality-specific encoders, a shared Transformer trunk, and task- or pretraining-specific output modules. In the 2026 HTT formulation, each tactile sensor $i \in \mathcal{I}$ is assigned a dedicated encoder $\mathcal{E}_i$ and decoder $\mathcal{D}_i$ . Optical sensors use MAE-style ViT encoders, array sensors use self-attention transformer encoders, and all encoder outputs are projected into a shared latent dimension $D = 192$ with 3 attention heads before entering a shared trunk $\mathcal{T}$ of depth 9. HTT does not use a $[\mathrm{CLS}]$ token; the trunk preserves the full token sequence, which downstream heads or policies consume directly (Bi et al., 29 Jun 2026).

For paired sensors $(i,j)$ , HTT introduces a cross-sensor predictor $\mathcal{P}_{ij}$ implemented as a stack of cross-attention transformer blocks with learnable mask tokens. During pretraining, $d$ 0 receives the full source embedding from sensor $d$ 1 together with the visible subset of target-sensor tokens from $d$ 2, and predicts the masked target embedding. These predictors are discarded after pretraining; only $d$ 3 remain as the reusable tactile backbone (Bi et al., 29 Jun 2026).

This organization matches the broader HTT design pattern described elsewhere. T3, for example, uses sensor-specific encoders to absorb low-level sensor shifts, a shared transformer trunk to learn sensor- and task-agnostic structure, and task-specific decoders for classification, regression, reconstruction, or policy conditioning. In that formulation, training batches are homogeneous in $d$ 4 pairs, but the shared trunk is reused across all pairs, enabling latent-space sharing without requiring cross-sensor alignment of raw data (Zhao et al., 2024).

The same pattern appears in policy-oriented settings. MiTaS uses modality-specific convolutional stems rather than per-sensor ViTs, but it preserves the HTT principle: heterogeneous visual and tactile streams are converted into a unified token sequence, augmented with learned positional and modality embeddings, and fused by full self-attention without a $d$ 5 bottleneck (Krohn et al., 4 Jun 2026). FTP-1 applies the idea at larger scale through heterogeneous encoders that produce morphology-aware tactile tokens for up to 24 functional areas, which are then modeled by a shared tactile Transformer expert of approximately 300M parameters (Yuan et al., 11 Jun 2026).

3. Data model, tokenization, and pretraining objectives

The named HTT framework is pretrained on the Heterogeneous Paired Tactile (HPT) dataset, which contains 1.6M synchronized paired tactile frames across four sensors spanning two tactile families: GelSight Mini and 9DTact as optical sensors, and Xela uSkin and TAC-02 as array sensors (Bi et al., 29 Jun 2026). Two hardware pairings are used: Pair A couples Xela with 9DTact, and Pair B couples TAC-02 with GelSight Mini. Data are collected through unscripted press, twist, and slide interactions with diverse household objects, and the paired streams are synchronized by a Universal Manipulation Interface setup (Bi et al., 29 Jun 2026).

HTT operates on short temporal windows of length $d$ 6. Optical inputs are resized to $d$ 7, background-subtracted using a non-contact reference image, and tokenized with spatial ViT patching plus temporal “tubelets”: “optical-based inputs $d$ 8, 2 tubelets along time of size 2) are split into non-overlapping spatial patches, yielding 196 tokens per tubelet,” hence 392 tokens per sample. Array signals are referenced by subtracting a non-contact baseline and are split into temporal patches of length 4, yielding 5 tokens for Xela or 10 tokens for TAC-02 within the same $d$ 9-window (Bi et al., 29 Jun 2026).

Pretraining combines per-modality masked reconstruction with paired cross-modal alignment. The masked-autoencoding term reconstructs normalized masked tokens:

$i \in \mathcal{I}$ 0

Masking ratios are 0.75 for optical modalities and 0.60 for taxel modalities. The alignment term performs regression on masked target embeddings with stop-gradient:

$i \in \mathcal{I}$ 1

Target-mask ratios for cross-modal prediction are 0.90 for optical targets and 0.80 for taxel targets. The total loss is

$i \in \mathcal{I}$ 2

where $i \in \mathcal{I}$ 3 is 0 during warmup, then linearly increases to $i \in \mathcal{I}$ 4 and remains fixed. Alignment gradients are blocked at the encoder outputs, so the encoders are trained solely by reconstruction while the shared trunk and predictors absorb the cross-sensor alignment signal (Bi et al., 29 Jun 2026).

The optimization regime is fully specified: AdamW, learning rate $i \in \mathcal{I}$ 5 with linear warmup over 2,000 steps from $i \in \mathcal{I}$ 6, cosine decay back to $i \in \mathcal{I}$ 7, batch size 256 paired samples, gradient clipping at 1.0, and 50,000 pretraining steps. The alignment weight is ramped during the first 20,000 steps to 0.1 (Bi et al., 29 Jun 2026).

This paired-data strategy differs from adjacent formulations. T3 explicitly avoids contrastive cross-sensor alignment because its multi-sensor corpus is unaligned, whereas TactX uses synchronized paired contacts across resistive, magnetic, and vision-based sensors and aligns them with NT-Xent on posterior means plus self- and cross-reconstruction in a shared 16-dimensional latent space (Zhao et al., 2024, Park et al., 30 Jun 2026). SITR, by contrast, addresses transfer across optical sensors through domain-randomized simulation of sensor designs, seeking invariance primarily through synthetic diversity rather than synchronized cross-sensor pairing (Gupta et al., 27 Feb 2025).

4. Empirical behavior in perception and manipulation

HTT is evaluated on object classification, force estimation, slip detection, real-world manipulation with unseen sensors, and simulated manipulation. The perception results show that masked pretraining is already strong, but paired cross-sensor alignment improves or matches it on most settings, especially for optical sensing and slip-related tasks (Bi et al., 29 Jun 2026).

For 20-way object classification, HTT attains 94.84 on 9DTact and 91.35 on GelSight Mini, compared with 90.08 and 88.59 for MAE-only pretraining. On TAC-02, HTT and MAE are nearly identical at 26.20 and 26.16, whereas on Xela HTT drops to 52.41 from 56.68. The reported overall average across sensors is 66.20 for HTT, 65.38 for MAE, 47.54 for training from scratch, and 77.83 for SITR reported as “optical only” (Bi et al., 29 Jun 2026).

For 3D force estimation, where lower mean absolute error is better, HTT yields 0.695 on Xela, 0.508 on TAC-02, 0.606 on 9DTact, and 0.736 on GelSight Mini, with an overall average of 0.636. The optical-only pretraining baselines are weaker on this task: T3 reports 2.678 on 9DTact and 1.197 on GelSight Mini, while SITR reports 1.085 and 1.373 respectively (Bi et al., 29 Jun 2026).

For slip detection, evaluated by macro-F1 because the automatically generated labels are highly imbalanced, HTT achieves 54.21 on Xela, 45.45 on TAC-02, 53.09 on 9DTact, and 72.65 on GelSight Mini, for an overall average of 56.35. This exceeds the corresponding MAE average of 51.62 and is markedly above T3 at 34.76 and SITR at 39.07 (Bi et al., 29 Jun 2026).

In real-world manipulation, HTT is used as the tactile observation encoder in ACT policies on a Franka arm with a Sharpa hand, under a deliberately stringent transfer condition: the fingertip tactile hardware is unseen during HTT pretraining, and zero-shot adaptation is performed by directly applying the 9DTact encoder to the new fingertip tactile images. On the toy-screw task, success rises from 5% with qpos-only observations and 50% with fingertip wrench inputs to 95% with HTT embeddings. On grasp-tofu, the corresponding numbers are 5%, 35%, and 55%; slip remains the main failure mode, but HTT reduces slip relative to force-only inputs and avoids crushing failures observed with wrench-based policies (Bi et al., 29 Jun 2026).

In simulated ManiFeel tasks, HTT also transfers across sensing modalities. For peg insertion, tacRGB reaches $i \in \mathcal{I}$ 8, T3 $i \in \mathcal{I}$ 9, SITR $\mathcal{E}_i$ 0, HTT(RGB) $\mathcal{E}_i$ 1, and HTT(FF) $\mathcal{E}_i$ 2. For bulb installation, the corresponding success rates are $\mathcal{E}_i$ 3, $\mathcal{E}_i$ 4, $\mathcal{E}_i$ 5, $\mathcal{E}_i$ 6, and $\mathcal{E}_i$ 7 (Bi et al., 29 Jun 2026).

A concise comparison of representative HTT-like systems is useful because the literature now contains several closely related but non-identical formulations.

System	Core mechanism	Reported scope
SITR (Gupta et al., 27 Feb 2025)	Transformer-based sensor-invariant latent from diverse simulated optical sensors	Zero-shot or minimal-calibration transfer across optical tactile sensors
T3 (Zhao et al., 2024)	Sensor-specific encoders, shared trunk, task-specific decoders; FoTa with 3,083,452 datapoints from 13 sensors and 11 tasks	Zero-shot in certain sensor-task pairings; 25% higher insertion success than scratch tactile encoders and 53% higher than without tactile sensing
MiTaS (Krohn et al., 4 Jun 2026)	Modality-specific CNN stems and transformer fusion over RGB, GelSight Mini, and Evetac	80% average success across five contact-rich manipulation tasks
TactX (Park et al., 30 Jun 2026)	Modality-specific encoders aligned by paired-contact NT-Xent, cross-reconstruction, and KL in a 16-D latent	Average zero-shot manipulation success improves from 27.5% to 45.9%
FTP-1 (Yuan et al., 11 Jun 2026)	Heterogeneous encoders into morphology-aware tactile tokens modeled by a shared tactile Transformer expert	+17.2% on seen sensor setups and +31.6 percentage points on unseen tactile sensors

5. HTT in the wider research lineage

The term HTT is explicit in the 2026 paper, but the architectural idea is older and broader. The survey “Transformer in Touch: A Survey” defines the relevant design space as Transformer-based architectures built to handle heterogeneous tactile inputs across sensor types, spatiotemporal structures, and cross-modal signals, and identifies the requisite components: sensor-aware tokenization, modality/type embeddings, within-modality encoders, cross-modality self- and cross-attention, and self-supervised pretraining such as masked modeling or contrastive alignment (Gao et al., 2024).

Within optical tactile sensing, SITR can be read as an HTT-style canonicalization strategy. It uses a transformer-based architecture trained on a diverse dataset of simulated sensor designs, with domain randomization over optics and mechanics, so that latent features emphasize contact semantics rather than sensor idiosyncrasies. Although the paper does not use the name HTT, its stated goal is exactly to learn a canonical, sensor-invariant tactile representation supporting zero-shot or minimal-calibration transfer to unseen optical sensors (Gupta et al., 27 Feb 2025).

T3 generalizes the same idea to multi-sensor, multi-task learning. Its FoTa dataset contains 3,083,452 datapoints gathered from 13 sensors and 11 tasks, and T3 operationalizes the HTT pattern through sensor-specific encoders, a shared trunk transformer, and task-specific decoders. The model scales in performance with network size, supports zero-shot transfer in certain sensor-task pairings, can be fine-tuned with about 2,000 samples, and functions as a tactile encoder for long-horizon electronics insertion, where the pretrained tactile encoder yields a success rate 25% higher than scratch-trained tactile encoders and 53% higher than policies without tactile sensing (Zhao et al., 2024).

MiTaS pushes HTT toward heterogeneous fusion at control time rather than only representation transfer. It combines two-frame RGB wrist observations at 25 Hz, two-frame GelSight Mini observations at 25 Hz, and 16-frame Evetac temporal volumes drawn from a 200 Hz event-based tactile stream. Modality-specific 2D and 3D CNN stems produce aligned token grids, learned positional and modality embeddings preserve spatial layout and sensor identity, and a transformer encoder performs full self-attention across the concatenated sequence. The fused tokens condition a flow-matching policy and yield an average success rate of 80%, compared with 31% for vision-only and 54% for a visual-tactile baseline (Krohn et al., 4 Jun 2026).

TactX extends the heterogeneous-tactile problem across transduction modalities that are more fundamentally distinct than optical versus array sensing. Its encoders map Daimon DM-Tac W, eFlesh, and FlexiTac observations into a shared latent $\mathcal{E}_i$ 8 using paired contact data as the alignment signal. The model uses NT-Xent with temperature $\mathcal{E}_i$ 9, self- and cross-reconstruction, and KL regularization to a shared Gaussian prior, yielding transitive cross-modal consistency and enabling zero-shot policy transfer across resistive, magnetic, and vision-based sensors (Park et al., 30 Jun 2026).

FTP-1 extends the HTT principle to a generalist manipulation policy. It aggregates around 3,000 hours of tactile manipulation data from 26 sources and 21 sensors, spanning image-, array-, and state-based tactile signals. Each modality is tokenized into a Morphology-Aware Tactile Token Space with up to 24 functional-area slots, and a shared tactile Transformer expert jointly models these tokens across sensors and embodiments. The resulting policy improves success on seen sensor setups by +17.2% and on unseen tactile-sensor setups by +31.6 percentage points, establishing an HTT-like formulation at the scale of foundation tactile policy learning (Yuan et al., 11 Jun 2026).

6. Limitations, unresolved issues, and likely directions

Despite the progress, current HTT formulations retain several structural limitations. The named HTT framework is pretrained only on optical and array sensors; magnetic-, fluid-based, and other tactile families are absent. Its HPT dataset uses only cross-family pairings, so the effect of optical–optical or array–array pairing is not explored. The method also relies on synchronized paired recordings, and the current alignment objective does not explicitly model geometric correspondences between, for example, a GelSight patch and a taxel location (Bi et al., 29 Jun 2026).

Paired-data dependence is a broader issue. TactX likewise requires synchronized contacts and careful alignment of pose, rotation, timestamp, and sensing area, and notes that residual misalignment can hurt both cross-reconstruction and latent alignment. Its representation learning is quasi-static and delegates temporal reasoning to the downstream ACT policy, so temporal phenomena such as incipient slip and vibration remain outside the shared latent itself (Park et al., 30 Jun 2026).

Temporal modeling remains uneven across the field. T3 is strong on multi-sensor and multi-task transfer, but its current formulation centers on per-image encoding and two-frame relative predictions rather than full sequence modeling; richer temporal transformers are identified there as a natural extension for slip detection and continuous contact state estimation. MiTaS, conversely, does model multi-rate temporal structure through 3D event stems and fixed sensor windows, but its deployment depends on specific tactile hardware and 4-DoF control at 15 Hz, with only 30 teleoperated demonstrations per task (Zhao et al., 2024, Krohn et al., 4 Jun 2026).

Scaling and data balance are also unresolved. T3 reports that FoTa is skewed toward two popular sensors, which may bias the shared trunk, while FTP-1 notes that even a ~3,000-hour corpus remains small relative to vision-language pretraining and that performance saturates after roughly 50k steps. The tactile survey frames this more generally as a data-hunger and compute problem for Transformer-based tactile systems, especially when long spatiotemporal streams or many sensor modalities are fused without efficient attention mechanisms (Yuan et al., 11 Jun 2026, Gao et al., 2024).

A plausible implication is that future HTT research will separate into two complementary directions. One direction emphasizes tighter sensor-agnostic representation learning through better alignment objectives, larger heterogeneous corpora, and encoder addition for new modalities. The other emphasizes policy integration, where HTT backbones serve as reusable tactile experts conditioned by morphology, time, and control context. The existing literature already outlines both paths: paired alignment and shared low-dimensional latents in TactX, shared multi-task trunks in T3, full self-attention fusion in MiTaS, morphology-aware tokenization in FTP-1, and explicit sensor-agnostic pretraining over optical–array pairs in HTT proper (Park et al., 30 Jun 2026, Zhao et al., 2024, Krohn et al., 4 Jun 2026, Yuan et al., 11 Jun 2026, Bi et al., 29 Jun 2026).