TacGen: Touch Is a Necessary Dimension of Physical-World Representation -- Addressing Tactile Data Scarcity with Scalable Vision-to-Touch Alignment and Generation

Published 28 Jun 2026 in cs.RO | (2606.29173v2)

Abstract: Touch resolves the physical-property ambiguity left by vision: exploratory contact recovers shape, texture, compliance, and material, and visuo-haptic object representations converge in ventral visual cortex. We ask whether representation learning can reproduce this grounding. TacGen mitigates the tactile-data scarcity bottleneck by combining pre-specified V+T contrastive alignment with a latent-space residual-MLP V->T generator that synthesizes tactile latents from RGB for tactile-data scaling. With matched DINOv2 backbones, splits, and probes, V+T improves matched V-only on mass (Delta R^2=+0.570), density (Delta acc=+0.067), hardness (+0.117), and uncertainty-banded force labels (Delta R^2=+0.281); all CIs exclude zero. The same representation lifts matched-capacity TACTO manipulation 0.246->0.979 while V-only capacity scaling accounts for only 4.5% of the gap, preserving 95.5%. The generator reaches cross-seed +0.589, with real tactile +0.585 inside the seed interval; the architecture comparison shows a 13pp downstream gap between reconstruction quality and representation utility. Across five-seed SSVTP/TVL reproductions, YCB-Sight transfer, three-backbone checks, permutation/random-feature controls, hash-verified manifests, and measured-force validation checks, the evidence supports the claim that touch supplies a necessary physical evidence channel for representations of contact-dependent properties.

Abstract PDF Upgrade to Chat

Authors (22)

First 10 authors:

Summary

The paper establishes that tactile sensing is crucial for inferring contact-dependent object properties that vision-only models miss.
The methodology uses visuo-tactile contrastive alignment and a latent diffusion generator to synthesize tactile features from RGB data.
Empirical results reveal significant gains in physical-property probes and robotic manipulation when integrating real and synthetic tactile data.

Touch as a Necessary Modality: A Technical Summary of "TacGen: Touch Is a Necessary Dimension of Physical-World Representation" (2606.29173)

Motivation and Problem Statement

The work addresses a crucial limitation in embodied AI: the inability of vision-only models to infer material properties such as mass, density, hardness, friction, and compliance when those properties are underdetermined by appearance. The authors posit that, analogous to primate perception, tactile input is not merely useful but fundamentally necessary for constructing physically grounded representations of contact-dependent object properties. The central challenge is the scarcity of large-scale, high-quality paired visuo-tactile datasets, which impedes the scaling and transfer of tactile-informed representations.

Methodology

Visuo-Tactile Contrastive Alignment

TacGen deploys a conservative evaluation protocol: both vision (RGB) and tactile (DIGIT sensor) modalities are processed using frozen DINOv2 backbones, with canonical SHA-256 verified feature extraction. Vision and tactile tokens are projected via MLP heads and aligned in a shared latent space using the symmetric InfoNCE contrastive loss. The methodology rigorously controls for confounds: fixed dataset splits, reproducible preprocessing (canonical background subtraction for tactile frames), strict separation of alignment and probe data, and systematic ablation studies.

Latent-Space Vision-to-Touch Generation

To address the tactile data bottleneck, TacGen introduces a residual-MLP diffusion generator that synthesizes tactile latent features directly from RGB tokens. Generation is performed in DINOv2 tactile feature space rather than at the pixel level, explicitly optimizing for downstream physical-property probe utility rather than image realism. Notably, a pixel-space U-Net diffusion model is used only as a reconstruction comparator; the 13 percentage point downstream-utility gap empirically establishes the necessity to privilege representation utility over nominal reconstruction quality.

Downstream Evaluation: Physical Property Probes and Manipulation

Evaluation spans four axes: physical-property regression/classification probes (mass, density, hardness, and banded force labels), representation scaling via generated tactile latents, TACTO-based manipulation policy utility, and cross-modal generalization (YCB-Sight transfer, backbone variation, permutation controls).

Empirical Results

Physical-Property Probe Gains

On fixed (pre-specified, held-out) test splits, vision+tactile alignment yields substantial improvements over vision-only at constant feature and probe budget:

Mass Regression: $+0.570$ $AR^2$ improvement (CI $[+0.485, +0.653]$ )
Density Classification: $+0.067$ accuracy delta
Hardness Classification: $+0.117$ accuracy delta
Force Regression (uncertainty-banded): $+0.281$ $AR^2$ improvement

Bootstrap CIs exclude zero throughout, with tactile/label permutation controls remaining near zero, confirming the gains are not artifacts of increased dimensionality or label distributions. In all five alignment seeds, the benefit persists. Three backbone families (DINOv2, CLIP+MAE, Sparsh) replicate the effect, excluding architecture-specific overfitting.

Latent Tactile Generation

The TacGen latent diffusion generator achieves a cross-seed $AR^2_{gen}=+0.589$ (CI $[+0.544, +0.634]$ ) on physical-property probes, with the protocol-matched real-tactile reference at $+0.585$ ; the distributions overlap, directly supporting synthetic tactile latents as viable evidence for contact-dependent property inference. Matched generated/real pairs are significantly more similar than random pairings. Controls (shuffled pairs) eliminate gains, confirming the essentiality of paired vision-tactile content rather than generic regularization.

Manipulation Policy Utility

In simulation (TACTO), a behavior-cloning policy conditioned on V+T features attains a mean success rate of $AR^2$ 0, compared to V-only at $AR^2$ 1 ( $AR^2$ 2; CI $AR^2$ 3). Scaling vision-only policy capacity (up to $AR^2$ 4 hidden width, $AR^2$ 5 epochs) explains only $AR^2$ 6 of the gain, conclusively establishing that tactile evidence, not model or data scaling, provides the missing information for manipulation.

Generalization and Robustness

YCB-Sight transfer and alternative probe protocols (cross-domain, regression/classification) confirm that the V+T advantage generalizes across corpora, probe design, and hardware. Information-theoretic controls (random feature, label permutation) register negative or near-zero gains, further verifying modality-specificity.

Theoretical and Practical Implications

The evidence demonstrates that touch is not an auxiliary or optional modality for physical-world representations where contact evidence is essential; rather, it is a necessary dimension. Scaling vision-only architectures, even with additional data and model capacity, cannot bridge the gap in contact-dependent property inference. Generated tactile latents, when appropriately aligned, can function as first-class analogs to real tactile data in downstream physical reasoning and manipulation. This finding has implications for datasets, model design, and the evaluation of embodied AI, including embodied LLMs and robotic policy learning.

In practice, the methodologies and artifacts enable:

Scalable generation and augmentation of tactile evidence for downstream learning in environments where measured tactile data is cost-prohibitive.
Deployment of visuo-tactile-aligned representations as reference baselines in benchmarks targeting contact-dependent properties, rather than treating tactile input as optional.
Application of uncertainty-banded physical label frameworks to drive robust evaluation and transfer across public visuo-tactile corpora.

Future Directions

Immediate research opportunities include scaling to additional tactile sensor families, broader real-world manipulation tasks, and more extensive integration within VLM, V-T-L, and embodied LLM frameworks for multi-modal action and perception. Improving absolute calibration of force/torque labels for large-scale paired RGB-tactile corpora will strengthen physical inference. Preliminary Qwen-based V-T-L interfaces indicate utility in explicit tactile-to-language evidence transfer; broader compositional grounding for tactile input in multimodal foundation models remains open.

Conclusion

TacGen establishes, through rigorous empirical and methodological evidence, that tactile sensing is a necessary information channel for the representation and inference of contact-dependent object properties. Synthetic tactile latents generated from vision preserve downstream utility and can be used to augment tactile evidence at scale. The results warrant treating aligned touch as a reference marker for physical-world representation in embodied AI, with direct implications for the design, evaluation, and deployment of multimodal agents and robotic systems (2606.29173).

Markdown Report Issue