OmniVaT: Visual-Tactile Multimodal Framework
- OmniVaT is a multimodal framework that unifies visual and tactile data in a shared embedding–frequency space using fractional transforms.
- It features the Multimodal Fractional Fourier Adapter (MFFA) and Discrete Tree Generation (DTG) modules to achieve robust single-domain generalization.
- Extensive experiments show significant Macro-F1 improvements and enhanced cross-modal alignment, validating its practical advances in VTL.
OmniVaT is a multimodal learning framework targeting single domain generalization for visual-tactile learning (VTL), where the objective is robust perception across sensory domains without requiring data from multiple domains during training. This is achieved by mitigating both modality discrepancies between visual (VIS) and tactile (TAC) images, and domain gaps arising from heterogeneous tactile sensors and collection procedures. OmniVaT introduces two principal components: the Multimodal Fractional Fourier Adapter (MFFA), for aligning VIS and TAC embeddings within a unified embedding–frequency space, and the Discrete Tree Generation (DTG) module, which generates diverse, reliable multimodal fractional representations through a hierarchical structure. Extensive experimental evidence substantiates OmniVaT’s superior cross-domain generalization capabilities within the SDG-VTL paradigm (Qiu et al., 1 Jan 2026).
1. Mathematical Foundation: Fractional Fourier Transform (FrFT)
The Fractional Fourier Transform generalizes the classical Fourier transform, parameterized by order , which geometrically “rotates” signals between the time (embedding) and frequency domains. Its continuous form is:
Here, is defined as:
- For : with .
- For : .
- For : .
Discrete FrFT (DFrFT) [Candan et al. 2000] uses Hermite–Gaussian eigenvectors:
where , with containing eigenvectors and the diagonal matrix of eigenvalues. Varying from interpolates smoothly between the input embedding and its Fourier spectrum.
2. Multimodal Fractional Fourier Adapter (MFFA): Architecture
MFFA serves as the bridge across modalities, aligning visual, tactile, and language features in a unified embedding–frequency space through fractional domain projections coupled with cross-modal attention.
Inputs:
- (Visual embedding)
- (Tactile embedding)
- (Language prompt embedding)
2.1 Language-Guided FrFT Processing
The language embedding is first expanded and projected by the FrFT:
where are learned expansion weights, denotes ReLU activation.
2.2 Fractional Fourier Attention (FrATT)
A global class token is aggregated over the batch. Fractional attention operates as:
where denote learned projections.
2.3 Language-Guided VIS/TAC Fractional Projection
Refined language features guide the projection of VIS/TAC embeddings:
- constructed identically, via shared weights
2.4 Modality-Alignment Loss
To align modalities in the fractional space, OmniVaT minimizes a KL divergence loss:
2.5 Choice of Fractional Order
In all main experiments, (the midpoint between original embedding and full spectrum) delivers optimal results, validated through a sweep over . is treated as a hyperparameter.
2.6 MFFA Forward Pass Pseudocode
1 2 3 4 5 6 7 8 9 10 |
F^l_real, F^l_imag = FrFT_p(θ_e,l · E^l) F^l = ReLU(F^l_real) + j · ReLU(F^l_imag) F^g_l = class_mean(F^l) ¯F^l = FrATT(F^l, F^g_l ⊕ F^l) tmp^v = θ_e,v · (¯F^l + E^v) F^v_real, F^v_imag = FrFT_p(tmp^v) F^v = ReLU(F^v_real) + j · ReLU(F^v_imag) F^g_v = class_mean(F^v) ¯F^v = FrATT(¯F^l, F^g_v ⊕ F^v) L_mma = λ · [ KL(¯F^l‖¯F^v) + KL(¯F^l‖¯F^t) ] |
3. Framework Integration and Discrete Tree Generation (DTG)
OmniVaT consists of:
- Frozen CLIP-pretrained visual, tactile, and language encoders
- MFFA module with shared weights for VIS/TAC
- DTG module performing domain augmentation by hierarchical tree feature generation
- Linear classifier with cross-entropy supervision
Forward Pass Overview:
- Extract via CLIP backbones.
- Apply MFFA to yield , and compute .
- Concatenate all (VIS/TAC/lang) into root tree feature .
- DTG constructs a binary tree of depth 3, generating augmented features as children nodes: .
- Node-diversity loss , where is cosine similarity among .
- Final feature fusion: , and analogously for TAC.
- Classification by linear head; CE loss .
Total loss: .
4. Implementation Regimen
- Backbones: Frozen CLIP ResNet-50, ViT-B/16, ViT-L/14 ()
- MFFA hyperparameters: Fractional order , expansion size , alignment weight
- DTG hyperparameters: Tree depth
- Optimization: SGD, learning rate 0.05, cosine warm-up, momentum 0.9, 20 epochs
- Batch sizes: 16 paired VIS-TAC samples, 16 prompt texts
- Objective: Combined loss
5. Experimental Validation
Ablation analysis on the “TAG→X” protocol using ViT-B/16 reveals:
| Model Variant | ACC (%) | Macro-F1 (%) | F1 Improvement (pp) |
|---|---|---|---|
| Baseline (PromptStyler + CE only) | 51.7 | 40.6 | – |
| +MFFA (w/o ) | 54.5 | 51.2 | +10.6 |
| +MFFA (with ) | 54.8 | 52.5 | +11.9 |
Cosine-margin analysis on unseen domains reports:
| Method | Margin |
|---|---|
| LDC | ≈0.04 |
| OmniVaT (MFFA+DTG) | ≈0.17 |
These experiments demonstrate MFFA produces a 12 percentage point Macro-F1 improvement over the baseline, attributable to improved cross-modal alignment in fractional space. Cosine-margin improvement is nearly fourfold on unseen domains (Qiu et al., 1 Jan 2026).
6. Limitations, Assumptions, and Prospects
- Training-only Language Prompts: Language information (class labeling) enters exclusively during training; at test time, MFFA cannot exploit textual cues.
- Fractional Order Hyperparameterization: The fractional order is set by cross-validation, not learned end-to-end.
- Modality Restriction: OmniVaT presently ingests image-based tactile data; vibration or force modalities are not included.
- Prospective Extensions: Future work intends to incorporate vibrotactile signals and enable end-to-end learning of the fractional order.
A plausible implication is that the MFFA paradigm could generalize to other cross-modal or sensor fusion settings, provided suitable embedding–frequency mappings. OmniVaT marks an initial step for robust single-domain generalization in visual–tactile multimodal learning, integrating advanced spectral methods and structured augmentation.