Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniVaT: Visual-Tactile Multimodal Framework

Updated 8 January 2026
  • OmniVaT is a multimodal framework that unifies visual and tactile data in a shared embedding–frequency space using fractional transforms.
  • It features the Multimodal Fractional Fourier Adapter (MFFA) and Discrete Tree Generation (DTG) modules to achieve robust single-domain generalization.
  • Extensive experiments show significant Macro-F1 improvements and enhanced cross-modal alignment, validating its practical advances in VTL.

OmniVaT is a multimodal learning framework targeting single domain generalization for visual-tactile learning (VTL), where the objective is robust perception across sensory domains without requiring data from multiple domains during training. This is achieved by mitigating both modality discrepancies between visual (VIS) and tactile (TAC) images, and domain gaps arising from heterogeneous tactile sensors and collection procedures. OmniVaT introduces two principal components: the Multimodal Fractional Fourier Adapter (MFFA), for aligning VIS and TAC embeddings within a unified embedding–frequency space, and the Discrete Tree Generation (DTG) module, which generates diverse, reliable multimodal fractional representations through a hierarchical structure. Extensive experimental evidence substantiates OmniVaT’s superior cross-domain generalization capabilities within the SDG-VTL paradigm (Qiu et al., 1 Jan 2026).

1. Mathematical Foundation: Fractional Fourier Transform (FrFT)

The Fractional Fourier Transform generalizes the classical Fourier transform, parameterized by order p=α/(π/2)Rp = \alpha/(\pi/2) \in \mathbb{R}, which geometrically “rotates” signals between the time (embedding) and frequency domains. Its continuous form is:

FrFTp(E(u0))=+Kp(u0,up)E(u0)du0\operatorname{FrFT}_p(E(u_0)) = \int_{-\infty}^{+\infty} K_p(u_0,u_p) E(u_0) du_0

Here, Kp(u0,up)K_p(u_0,u_p) is defined as:

  • For αnπ\alpha \neq n\pi: Kp(u0,up)=Aαexp[j(12u02cotαu0upcscα+12up2cotα)]K_p(u_0,u_p) = A_\alpha \exp[j(\frac{1}{2} u_0^2 \cot \alpha - u_0 u_p \csc \alpha + \frac{1}{2} u_p^2 \cot \alpha)] with Aα=(1jcotα)/2πA_\alpha = \sqrt{(1-j \cot \alpha)/2\pi}.
  • For α=2nπ\alpha = 2n\pi: Kp=δ(u0up)K_p = \delta(u_0 - u_p).
  • For α=(2n+1)π\alpha = (2n+1)\pi: Kp=δ(u0+up)K_p = \delta(u_0 + u_p).

Discrete FrFT (DFrFT) [Candan et al. 2000] uses Hermite–Gaussian eigenvectors:

E(up)[m]=nFp[m,n]E(u0)[n],m,n=1DE(u_p)[m] = \sum_n F_p[m,n] E(u_0)[n], \quad m,n = 1 \ldots D

where Fp=VΛpVTF_p = V \Lambda^p V^T, with VV containing eigenvectors and Λ\Lambda the diagonal matrix of eigenvalues. Varying pp from 010 \to 1 interpolates smoothly between the input embedding and its Fourier spectrum.

2. Multimodal Fractional Fourier Adapter (MFFA): Architecture

MFFA serves as the bridge across modalities, aligning visual, tactile, and language features in a unified embedding–frequency space through fractional domain projections coupled with cross-modal attention.

Inputs:

  • EvR1×DE^v \in \mathbb{R}^{1\times D} (Visual embedding)
  • EtR1×DE^t \in \mathbb{R}^{1\times D} (Tactile embedding)
  • ER1×DE^\ell \in \mathbb{R}^{1\times D} (Language prompt embedding)

2.1 Language-Guided FrFT Processing

The language embedding is first expanded and projected by the FrFT:

F=U(Re{FrFTp(θe,E)})+jU(Im{FrFTp(θe,E)})F^\ell = U(\operatorname{Re}\{\operatorname{FrFT}_p(\theta_{e,\ell} E^\ell)\}) + j U(\operatorname{Im}\{\operatorname{FrFT}_p(\theta_{e,\ell} E^\ell)\})

where θe,RE×1\theta_{e,\ell} \in \mathbb{R}^{E \times 1} are learned expansion weights, U()U(\cdot) denotes ReLU activation.

2.2 Fractional Fourier Attention (FrATT)

A global class token Fg=meanclass(F)F^{g\ell} = \operatorname{mean}_\text{class}(F^\ell) is aggregated over the batch. Fractional attention operates as:

Fˉ=FrATT(F,FgF)=FFN(Mean(Softmax(θQF(θK(FgF))TD)θV(FgF)))\bar{F}^\ell = \text{FrATT}(F^\ell, F^{g\ell} \oplus F^\ell) = \operatorname{FFN}\left(\operatorname{Mean}\left(\operatorname{Softmax}\left(\frac{\theta_Q F^\ell \cdot (\theta_K(F^{g\ell} \oplus F^\ell))^T}{\sqrt{D}}\right) \cdot \theta_V(F^{g\ell} \oplus F^\ell)\right)\right)

where θQ,θK,θVRD×D\theta_Q, \theta_K, \theta_V \in \mathbb{R}^{D \times D} denote learned projections.

2.3 Language-Guided VIS/TAC Fractional Projection

Refined language features Fˉ\bar{F}^\ell guide the projection of VIS/TAC embeddings:

  • Fv=FrFTProcessing(θe,v(Fˉ+Ev))F^v = \operatorname{FrFTProcessing}(\theta_{e,v} (\bar{F}^\ell + E^v))
  • Fˉv=FrATT(Fˉ,FgvFv)\bar{F}^v = \operatorname{FrATT}(\bar{F}^\ell, F_g^v \oplus F^v)
  • Ft,FˉtF^t,\, \bar{F}^t constructed identically, via shared weights

2.4 Modality-Alignment Loss

To align modalities in the fractional space, OmniVaT minimizes a KL divergence loss:

Lmma=λ[KL(FˉFˉv)+KL(FˉFˉt)]L_{mma} = \lambda \left[ \operatorname{KL}(\bar{F}^\ell || \bar{F}^v) + \operatorname{KL}(\bar{F}^\ell || \bar{F}^t) \right]

2.5 Choice of Fractional Order pp

In all main experiments, p=0.5p = 0.5 (the midpoint between original embedding and full spectrum) delivers optimal results, validated through a sweep over p{0,0.25,0.5,0.75,1}p \in \{0, 0.25, 0.5, 0.75, 1\}. pp is treated as a hyperparameter.

2.6 MFFA Forward Pass Pseudocode

1
2
3
4
5
6
7
8
9
10
F^l_real, F^l_imag = FrFT_p(θ_e,l · E^l)
F^l = ReLU(F^l_real) + j · ReLU(F^l_imag)
F^g_l = class_mean(F^l)
¯F^l = FrATT(F^l, F^g_l  F^l)
tmp^v = θ_e,v · (¯F^l + E^v)
F^v_real, F^v_imag = FrFT_p(tmp^v)
F^v = ReLU(F^v_real) + j · ReLU(F^v_imag)
F^g_v = class_mean(F^v)
¯F^v = FrATT(¯F^l, F^g_v  F^v)
L_mma = λ · [ KL(¯F^l¯F^v) + KL(¯F^l¯F^t) ]

3. Framework Integration and Discrete Tree Generation (DTG)

OmniVaT consists of:

  • Frozen CLIP-pretrained visual, tactile, and language encoders
  • MFFA module with shared weights for VIS/TAC
  • DTG module performing domain augmentation by hierarchical tree feature generation
  • Linear classifier with cross-entropy supervision

Forward Pass Overview:

  1. Extract Ev,Et,EE^v, E^t, E^\ell via CLIP backbones.
  2. Apply MFFA to yield Fˉv,Fˉt,Fˉ\bar{F}^v, \bar{F}^t, \bar{F}^\ell, and compute LmmaL_{mma}.
  3. Concatenate all (VIS/TAC/lang) into root tree feature T(1)T^{(1)}.
  4. DTG constructs a binary tree of depth 3, generating augmented features as children nodes: Tm,n(r+1)=G(Tm(r),Wm,n(r))T^{(r+1)}_{m,n} = G(T^{(r)}_m, W^{(r)}_{m,n}).
  5. Node-diversity loss Lnod=rA(r)IFL_{nod} = \sum_r ||A^{(r)} - I||_F, where Ai,j(r)A^{(r)}_{i,j} is cosine similarity among Ti(r),Tj(r)T^{(r)}_i, T^{(r)}_j.
  6. Final feature fusion: F^v=mean(Tleaf)+Fˉv\hat{F}^v = \operatorname{mean}(T^{leaf}) + \bar{F}^v, and analogously for TAC.
  7. Classification by linear head; CE loss LCEL_\mathrm{CE}.

Total loss: L=Lmma+Lnod+LCEL = L_{mma} + L_{nod} + L_\mathrm{CE}.

4. Implementation Regimen

  • Backbones: Frozen CLIP ResNet-50, ViT-B/16, ViT-L/14 (D=512,768,1024D = 512, 768, 1024)
  • MFFA hyperparameters: Fractional order p=0.5p = 0.5, expansion size E=4E = 4, alignment weight λ=10\lambda = 10
  • DTG hyperparameters: Tree depth R=3R = 3
  • Optimization: SGD, learning rate 0.05, cosine warm-up, momentum 0.9, 20 epochs
  • Batch sizes: 16 paired VIS-TAC samples, 16 prompt texts
  • Objective: Combined loss L=Lmma+Lnod+LCEL = L_{mma} + L_{nod} + L_\mathrm{CE}

5. Experimental Validation

Ablation analysis on the “TAG→X” protocol using ViT-B/16 reveals:

Model Variant ACC (%) Macro-F1 (%) F1 Improvement (pp)
Baseline (PromptStyler + CE only) 51.7 40.6
+MFFA (w/o LmmaL_{mma}) 54.5 51.2 +10.6
+MFFA (with LmmaL_{mma}) 54.8 52.5 +11.9

Cosine-margin analysis on unseen domains reports:

Method Margin
LDC ≈0.04
OmniVaT (MFFA+DTG) ≈0.17

These experiments demonstrate MFFA produces a \sim12 percentage point Macro-F1 improvement over the baseline, attributable to improved cross-modal alignment in fractional space. Cosine-margin improvement is nearly fourfold on unseen domains (Qiu et al., 1 Jan 2026).

6. Limitations, Assumptions, and Prospects

  • Training-only Language Prompts: Language information (class labeling) enters exclusively during training; at test time, MFFA cannot exploit textual cues.
  • Fractional Order Hyperparameterization: The fractional order pp is set by cross-validation, not learned end-to-end.
  • Modality Restriction: OmniVaT presently ingests image-based tactile data; vibration or force modalities are not included.
  • Prospective Extensions: Future work intends to incorporate vibrotactile signals and enable end-to-end learning of the fractional order.

A plausible implication is that the MFFA paradigm could generalize to other cross-modal or sensor fusion settings, provided suitable embedding–frequency mappings. OmniVaT marks an initial step for robust single-domain generalization in visual–tactile multimodal learning, integrating advanced spectral methods and structured augmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniVaT Framework.