Humanoid Transformer with Touch Dreaming (HTD)

Updated 16 April 2026

The paper introduces a unified framework for whole-body humanoid manipulation that combines RL-based control, VR teleoperation, and a multimodal Transformer with latent-space touch dreaming.
HTD employs dedicated modality tokenizers and an encoder–decoder Transformer to fuse vision, proprioception, force, and tactile data into a coherent latent representation.
The approach achieves a 90.9% success improvement over baselines, demonstrating robust performance in diverse, contact-intensive real-world tasks.

Humanoid Transformer with Touch Dreaming (HTD) is an integrated framework for dexterous, contact-rich whole-body humanoid manipulation, unifying RL-based whole-body control, scalable VR teleoperation data collection, and a multimodal Transformer policy that predicts both actions and imagined future touch feedback. HTD models vision, proprioception, force, and high-dimensional tactile sensing as core modalities, and introduces latent-space touch dreaming to facilitate contact-aware policy learning. The approach achieves substantial improvements in real-world humanoid loco-manipulation across diverse, physically challenging tasks (Niu et al., 14 Apr 2026).

1. System Architecture

HTD is architected as a single-stage imitation policy around an encoder–decoder Transformer accepting multimodal robot observables and outputting both future action sequences and predictive models of near-future tactile events. The design is modular, enabling direct integration of diverse sensing streams.

1.1 Modality Tokenizers

Each input stream $m \in \{\text{head‐cam}, \text{wrist‐cam}_1, \text{wrist‐cam}_2, \text{proprio}, \text{force}, \text{tactile}\}$ is processed by a dedicated featurization network: ResNet for images, MLPs for proprioception and force, and per-region CNNs for segmented tactile sensor arrays. Resulting features are then compressed via cross-attention aggregation into a fixed number of modality tokens: $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$

1.2 Transformer Trunk and Decoding

The concatenated tokens from all modalities ( $\sum_m N_m$ ) are passed through a standard Transformer encoder with multi-head self-attention and feedforward layers, generating context-rich latent state embeddings. In the decoder, a set of learnable query embeddings $\{q_k\}$ produce fixed-size output tokens, consumed by downstream prediction heads.

1.3 Action Experts and Chunked Output

Post-decoder, each control modality (base-velocity, torso-pose, end-effector, hand joints) is assigned an expert head: a cross-attention block drawing from the decoded tokens to predict an action chunk $\{\mathbf{a}_{t+1}, \dots, \mathbf{a}_{t+h}\}$ for chunk length $h$ . This chunked output scheme allows lower streaming policy rates (e.g., 30 Hz) while furnishing high-frequency references to the whole-body controller (50 Hz).

1.4 Multimodal Fusion

Global (head) and local (wrist) vision, proprioceptive state, per-joint hand forces, and anatomically segmented tactile arrays (1062 sensors per hand, 6 regions per hand) are independently tokenized and then fused within the shared encoder, ensuring every expert head and the touch-dreaming mechanism operate over a joint latent representation.

2. Touch Dreaming: Predictive Tactile Modeling

HTD introduces touch dreaming: explicit prediction of short-horizon future hand-joint forces and tactile latents. This acts as an additional regularizer, enriching latent representations with contact awareness beyond what can be derived from raw action imitation objectives alone.

2.1 Dreaming Targets

Force Heads output the next $\tau$ timesteps of hand joint force or torque vectors: $\hat{\mathbf{f}}_{t+1}, \dots, \hat{\mathbf{f}}_{t+\tau}$ .
Tactile-Latent Heads output predicted tactile latent vectors for the same horizon: $\hat{\mathbf{z}}_{t+1}, \dots, \hat{\mathbf{z}}_{t+\tau}$ .

2.2 Latent-Space Tactile Modeling

Rather than operate directly on the sparse, high-dimensional raw tactile arrays, HTD predicts compact tactile representations in a learned latent space. A teacher tactile tokenizer $T_{\rm tact}^T$ , realized as an exponential moving average (EMA) of the student tactile encoder, provides target latents for future ground-truth tactile states: $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 0 This avoids representation collapse and provides stable, semantically meaningful supervision without separate pretraining.

3. Objectives, Loss Functions, and Optimization

HTD training combines behavioral cloning (BC) of action chunks, force dreaming loss, and latent tactile dreaming loss, yielding joint optimization of actuation and predictive somatosensory modeling.

3.1 Behavioral Cloning on Action Chunks

For demonstration data $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 1, the BC objective per action modality $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 2 is: $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 3 where $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 4 is the chunk length.

3.2 Force Dreaming Loss

Predicted force vectors are compared to future measured forces: $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 5

3.3 Tactile Latent Dreaming Loss

For tactile latents, a composite loss combines cosine similarity and magnitude alignment: $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 6

3.4 Total Objective

The aggregated loss for parameter set $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 7 is: $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 8

3.5 Training Loop

The learning process is succinctly expressed as: $T_m: x_m \mapsto \{t_{m,1}, \dots, t_{m,N_m}\}$ 9

4. Data Collection and Policy Supervision

HTD’s policy training is facilitated by a comprehensive, real-world demonstration pipeline, sourcing multimodal data at scale.

4.1 RL-Based Whole-Body Controller

A teacher–student RL pipeline (PPO→DAgger) trains a whole-body controller (WBC) in simulation to track base velocity and torso orientation/height. The teacher leverages privileged contact information; the student uses only proprioception plus short history, outputting 15 lower-body joint targets (12 leg DoF, 3 waist DoF).

4.2 VR Teleoperation and Human-to-Humanoid Mapping

Demonstrators use VR head and hand trackers plus a joystick for real-time base control, with human motions mapped to robot torso (rpy, height), 6D end-effector pose, and high-DoF hand commands. Control stack at 50 Hz:

WBC: lower-body joins
IK solver: upper-body
DexPilot: hand actuation

4.3 Multimodal Demonstration Dataset

Synchronized data streams at 50 Hz per timestep include dual-lens head RGB, two wrist RGB, proprioception, per-joint hand force, high-resolution tactile (17 regions × ~60 sensors ≈ 1062 channels/hand), and target action commands. This provides extensive samples for robust behavior learning under diverse contact scenarios.

5. Experimental Evaluation

Performance is measured across five contact-rich real-world manipulation tasks using 20 trials per variant.

5.1 Task List

Task	Short Description
Insert-T	3.5 mm-tight peg insertion using whole-body coordination
Book Organization	Overhanging push, grasp, and shelving of hardcover books
Towel Folding	Multi-stage manipulation of deformable fabric
Cat Litter Scooping	Squatting, scoop grasp/manipulation, granular transfer
Tea Serving	Walking transfer of two cups and precise placement

5.2 Baselines and Ablations

ACT (Visual + Proprio): decoder-only Transformer, vision and proprio as input
ACT (Visual + Proprio + Touch): above plus force and tactile
Touch ablation variants: no touch/dream, touch input only, raw tactile dreaming, latent-space tactile dreaming (full HTD)

5.3 Quantitative Results

HTD achieves a 90.9% relative success improvement (success rate) over the strongest ACT baseline, and a 31.1% relative gain in average task-score. Simply providing touch input, without dreaming, yielded marginal or no benefit. Only explicit dream-based supervision led to robust generalization.

5.4 Touch Ablation Outcomes

Latent-space tactile dreaming yields approximately 30% greater relative success than direct raw tactile array prediction, and consistently outperforms all non-dreaming and raw-touch-dreaming conditions. This suggests that compact, semantically meaningful tactile latents provide more effective supervision for manipulation requiring nuanced contact awareness.

5.5 Touch Prediction Visualization

HTD’s predicted force and tactile-latent trajectories closely track measured ground truth across sustained contact intervals, capturing the timing and strength of interactions. Divergence is most pronounced at abrupt contact transitions, a likely consequence of open-loop inference. The latent representations learned by touch dreaming display reduced sensitivity to sensor noise and encapsulate structured features associated with high-contact-force events.

6. Context and Implications

HTD demonstrates that combining a robust, RL-trained whole-body controller with scalable VR teleoperation data and a single-stage multimodal Transformer policy, augmented by predictive somatosensory modeling via touch dreaming, results in significant gains for generalizable real-world humanoid manipulation. Latent-space tactile dreaming, in particular, provides semantically stable and efficient contact-relevant cues for policy optimization, as evidenced by real-world ablations and task performance. These results underscore the value of predictive tactile modeling in enabling robust loco-manipulation in environments characterized by frequent and varied contact dynamics (Niu et al., 14 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Learning Versatile Humanoid Manipulation with Touch Dreaming (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Humanoid Transformer with Touch Dreaming (HTD).

Humanoid Transformer with Touch Dreaming (HTD)

1. System Architecture

1.1 Modality Tokenizers

1.2 Transformer Trunk and Decoding

1.3 Action Experts and Chunked Output

1.4 Multimodal Fusion

2. Touch Dreaming: Predictive Tactile Modeling

2.1 Dreaming Targets

2.2 Latent-Space Tactile Modeling

3. Objectives, Loss Functions, and Optimization

3.1 Behavioral Cloning on Action Chunks

3.2 Force Dreaming Loss

3.3 Tactile Latent Dreaming Loss

3.4 Total Objective

3.5 Training Loop

4. Data Collection and Policy Supervision

4.1 RL-Based Whole-Body Controller

4.2 VR Teleoperation and Human-to-Humanoid Mapping

4.3 Multimodal Demonstration Dataset

5. Experimental Evaluation

5.1 Task List

5.2 Baselines and Ablations

5.3 Quantitative Results

5.4 Touch Ablation Outcomes

5.5 Touch Prediction Visualization

6. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Humanoid Transformer with Touch Dreaming (HTD)

1. System Architecture

1.1 Modality Tokenizers

1.2 Transformer Trunk and Decoding

1.3 Action Experts and Chunked Output

1.4 Multimodal Fusion

2. Touch Dreaming: Predictive Tactile Modeling

2.1 Dreaming Targets

2.2 Latent-Space Tactile Modeling

3. Objectives, Loss Functions, and Optimization

3.1 Behavioral Cloning on Action Chunks

3.2 Force Dreaming Loss

3.3 Tactile Latent Dreaming Loss

3.4 Total Objective

3.5 Training Loop

4. Data Collection and Policy Supervision

4.1 RL-Based Whole-Body Controller

4.2 VR Teleoperation and Human-to-Humanoid Mapping

4.3 Multimodal Demonstration Dataset

5. Experimental Evaluation

5.1 Task List

5.2 Baselines and Ablations

5.3 Quantitative Results

5.4 Touch Ablation Outcomes

5.5 Touch Prediction Visualization

6. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research