OmniVaT: Visual-Tactile Multimodal Framework

Updated 8 January 2026

OmniVaT is a multimodal framework that unifies visual and tactile data in a shared embedding–frequency space using fractional transforms.
It features the Multimodal Fractional Fourier Adapter (MFFA) and Discrete Tree Generation (DTG) modules to achieve robust single-domain generalization.
Extensive experiments show significant Macro-F1 improvements and enhanced cross-modal alignment, validating its practical advances in VTL.

OmniVaT is a multimodal learning framework targeting single domain generalization for visual-tactile learning (VTL), where the objective is robust perception across sensory domains without requiring data from multiple domains during training. This is achieved by mitigating both modality discrepancies between visual (VIS) and tactile (TAC) images, and domain gaps arising from heterogeneous tactile sensors and collection procedures. OmniVaT introduces two principal components: the Multimodal Fractional Fourier Adapter (MFFA), for aligning VIS and TAC embeddings within a unified embedding–frequency space, and the Discrete Tree Generation (DTG) module, which generates diverse, reliable multimodal fractional representations through a hierarchical structure. Extensive experimental evidence substantiates OmniVaT’s superior cross-domain generalization capabilities within the SDG-VTL paradigm (Qiu et al., 1 Jan 2026).

1. Mathematical Foundation: Fractional Fourier Transform (FrFT)

The Fractional Fourier Transform generalizes the classical Fourier transform, parameterized by order $p = \alpha/(\pi/2) \in \mathbb{R}$ , which geometrically “rotates” signals between the time (embedding) and frequency domains. Its continuous form is:

$\operatorname{FrFT}_p(E(u_0)) = \int_{-\infty}^{+\infty} K_p(u_0,u_p) E(u_0) du_0$

Here, $K_p(u_0,u_p)$ is defined as:

For $\alpha \neq n\pi$ : $K_p(u_0,u_p) = A_\alpha \exp[j(\frac{1}{2} u_0^2 \cot \alpha - u_0 u_p \csc \alpha + \frac{1}{2} u_p^2 \cot \alpha)]$ with $A_\alpha = \sqrt{(1-j \cot \alpha)/2\pi}$ .
For $\alpha = 2n\pi$ : $K_p = \delta(u_0 - u_p)$ .
For $\alpha = (2n+1)\pi$ : $K_p = \delta(u_0 + u_p)$ .

Discrete FrFT (DFrFT) [Candan et al. 2000] uses Hermite–Gaussian eigenvectors:

$E(u_p)[m] = \sum_n F_p[m,n] E(u_0)[n], \quad m,n = 1 \ldots D$

where $F_p = V \Lambda^p V^T$ , with $V$ containing eigenvectors and $\Lambda$ the diagonal matrix of eigenvalues. Varying $p$ from $0 \to 1$ interpolates smoothly between the input embedding and its Fourier spectrum.

2. Multimodal Fractional Fourier Adapter (MFFA): Architecture

MFFA serves as the bridge across modalities, aligning visual, tactile, and language features in a unified embedding–frequency space through fractional domain projections coupled with cross-modal attention.

Inputs:

$E^v \in \mathbb{R}^{1\times D}$ (Visual embedding)
$E^t \in \mathbb{R}^{1\times D}$ (Tactile embedding)
$E^\ell \in \mathbb{R}^{1\times D}$ (Language prompt embedding)

2.1 Language-Guided FrFT Processing

The language embedding is first expanded and projected by the FrFT:

$F^\ell = U(\operatorname{Re}\{\operatorname{FrFT}_p(\theta_{e,\ell} E^\ell)\}) + j U(\operatorname{Im}\{\operatorname{FrFT}_p(\theta_{e,\ell} E^\ell)\})$

where $\theta_{e,\ell} \in \mathbb{R}^{E \times 1}$ are learned expansion weights, $U(\cdot)$ denotes ReLU activation.

2.2 Fractional Fourier Attention (FrATT)

A global class token $F^{g\ell} = \operatorname{mean}_\text{class}(F^\ell)$ is aggregated over the batch. Fractional attention operates as:

$\bar{F}^\ell = \text{FrATT}(F^\ell, F^{g\ell} \oplus F^\ell) = \operatorname{FFN}\left(\operatorname{Mean}\left(\operatorname{Softmax}\left(\frac{\theta_Q F^\ell \cdot (\theta_K(F^{g\ell} \oplus F^\ell))^T}{\sqrt{D}}\right) \cdot \theta_V(F^{g\ell} \oplus F^\ell)\right)\right)$

where $\theta_Q, \theta_K, \theta_V \in \mathbb{R}^{D \times D}$ denote learned projections.

2.3 Language-Guided VIS/TAC Fractional Projection

Refined language features $\bar{F}^\ell$ guide the projection of VIS/TAC embeddings:

$F^v = \operatorname{FrFTProcessing}(\theta_{e,v} (\bar{F}^\ell + E^v))$
$\bar{F}^v = \operatorname{FrATT}(\bar{F}^\ell, F_g^v \oplus F^v)$
$F^t,\, \bar{F}^t$ constructed identically, via shared weights

2.4 Modality-Alignment Loss

To align modalities in the fractional space, OmniVaT minimizes a KL divergence loss:

$L_{mma} = \lambda \left[ \operatorname{KL}(\bar{F}^\ell || \bar{F}^v) + \operatorname{KL}(\bar{F}^\ell || \bar{F}^t) \right]$

2.5 Choice of Fractional Order $p$

In all main experiments, $p = 0.5$ (the midpoint between original embedding and full spectrum) delivers optimal results, validated through a sweep over $p \in \{0, 0.25, 0.5, 0.75, 1\}$ . $p$ is treated as a hyperparameter.

2.6 MFFA Forward Pass Pseudocode

F^l_real, F^l_imag = FrFT_p(θ_e,l · E^l)
F^l = ReLU(F^l_real) + j · ReLU(F^l_imag)
F^g_l = class_mean(F^l)
¯F^l = FrATT(F^l, F^g_l ⊕ F^l)
tmp^v = θ_e,v · (¯F^l + E^v)
F^v_real, F^v_imag = FrFT_p(tmp^v)
F^v = ReLU(F^v_real) + j · ReLU(F^v_imag)
F^g_v = class_mean(F^v)
¯F^v = FrATT(¯F^l, F^g_v ⊕ F^v)
L_mma = λ · [ KL(¯F^l‖¯F^v) + KL(¯F^l‖¯F^t) ]

3. Framework Integration and Discrete Tree Generation (DTG)

OmniVaT consists of:

Frozen CLIP-pretrained visual, tactile, and language encoders
MFFA module with shared weights for VIS/TAC
DTG module performing domain augmentation by hierarchical tree feature generation
Linear classifier with cross-entropy supervision

Forward Pass Overview:

Extract $E^v, E^t, E^\ell$ via CLIP backbones.
Apply MFFA to yield $\bar{F}^v, \bar{F}^t, \bar{F}^\ell$ , and compute $L_{mma}$ .
Concatenate all (VIS/TAC/lang) into root tree feature $T^{(1)}$ .
DTG constructs a binary tree of depth 3, generating augmented features as children nodes: $T^{(r+1)}_{m,n} = G(T^{(r)}_m, W^{(r)}_{m,n})$ .
Node-diversity loss $L_{nod} = \sum_r ||A^{(r)} - I||_F$ , where $A^{(r)}_{i,j}$ is cosine similarity among $T^{(r)}_i, T^{(r)}_j$ .
Final feature fusion: $\hat{F}^v = \operatorname{mean}(T^{leaf}) + \bar{F}^v$ , and analogously for TAC.
Classification by linear head; CE loss $L_\mathrm{CE}$ .

Total loss: $L = L_{mma} + L_{nod} + L_\mathrm{CE}$ .

4. Implementation Regimen

Backbones: Frozen CLIP ResNet-50, ViT-B/16, ViT-L/14 ( $D = 512, 768, 1024$ )
MFFA hyperparameters: Fractional order $p = 0.5$ , expansion size $E = 4$ , alignment weight $\lambda = 10$
DTG hyperparameters: Tree depth $R = 3$
Optimization: SGD, learning rate 0.05, cosine warm-up, momentum 0.9, 20 epochs
Batch sizes: 16 paired VIS-TAC samples, 16 prompt texts
Objective: Combined loss $L = L_{mma} + L_{nod} + L_\mathrm{CE}$

5. Experimental Validation

Ablation analysis on the “TAG→X” protocol using ViT-B/16 reveals:

Model Variant	ACC (%)	Macro-F1 (%)	F1 Improvement (pp)
Baseline (PromptStyler + CE only)	51.7	40.6	–
+MFFA (w/o $L_{mma}$ )	54.5	51.2	+10.6
+MFFA (with $L_{mma}$ )	54.8	52.5	+11.9

Cosine-margin analysis on unseen domains reports:

Method	Margin
LDC	≈0.04
OmniVaT (MFFA+DTG)	≈0.17

These experiments demonstrate MFFA produces a $\sim$ 12 percentage point Macro-F1 improvement over the baseline, attributable to improved cross-modal alignment in fractional space. Cosine-margin improvement is nearly fourfold on unseen domains (Qiu et al., 1 Jan 2026).

6. Limitations, Assumptions, and Prospects

Training-only Language Prompts: Language information (class labeling) enters exclusively during training; at test time, MFFA cannot exploit textual cues.
Fractional Order Hyperparameterization: The fractional order $p$ is set by cross-validation, not learned end-to-end.
Modality Restriction: OmniVaT presently ingests image-based tactile data; vibration or force modalities are not included.
Prospective Extensions: Future work intends to incorporate vibrotactile signals and enable end-to-end learning of the fractional order.

A plausible implication is that the MFFA paradigm could generalize to other cross-modal or sensor fusion settings, provided suitable embedding–frequency mappings. OmniVaT marks an initial step for robust single-domain generalization in visual–tactile multimodal learning, integrating advanced spectral methods and structured augmentation.

Markdown Report Issue Upgrade to Chat

References (1)

OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniVaT Framework.

OmniVaT: Visual-Tactile Multimodal Framework

1. Mathematical Foundation: Fractional Fourier Transform (FrFT)

2. Multimodal Fractional Fourier Adapter (MFFA): Architecture

2.1 Language-Guided FrFT Processing

2.2 Fractional Fourier Attention (FrATT)

2.3 Language-Guided VIS/TAC Fractional Projection

2.4 Modality-Alignment Loss

2.5 Choice of Fractional Order $p$

2.6 MFFA Forward Pass Pseudocode

3. Framework Integration and Discrete Tree Generation (DTG)

4. Implementation Regimen

5. Experimental Validation

6. Limitations, Assumptions, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OmniVaT: Visual-Tactile Multimodal Framework

1. Mathematical Foundation: Fractional Fourier Transform (FrFT)

2. Multimodal Fractional Fourier Adapter (MFFA): Architecture

2.1 Language-Guided FrFT Processing

2.2 Fractional Fourier Attention (FrATT)

2.3 Language-Guided VIS/TAC Fractional Projection

2.4 Modality-Alignment Loss

2.5 Choice of Fractional Order ppp

2.6 MFFA Forward Pass Pseudocode

3. Framework Integration and Discrete Tree Generation (DTG)

4. Implementation Regimen

5. Experimental Validation

6. Limitations, Assumptions, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

2.5 Choice of Fractional Order $p$