Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Fractional Fourier Adapter

Updated 8 January 2026
  • Multimodal Fractional Fourier Adapter is an architectural module that leverages the fractional Fourier transform to unify visual and tactile sensor embeddings in a joint frequency space.
  • It integrates modality-specific attention mechanisms and learnable projections to align outputs from visual, tactile, and training-only language encoders.
  • Empirical results demonstrate improved single-domain generalization, with macro-F1 scores increasing by up to 18.5 points over conventional fusion approaches.

The Multimodal Fractional Fourier Adapter (MFFA) is an architectural module designed to bridge the modality gap between visual and tactile sensor representations in embodied agents, particularly within the context of single domain generalization for multimodal visual-tactile learning (SDG-VTL). MFFA operates by projecting embeddings from visual (VIS), tactile (TAC), and (during training) language (LANG) encoders into a joint embedding-frequency space using the fractional Fourier transform (FrFT), followed by modality-specific attention and alignment mechanisms. This approach mitigates discrepancies between VIS and TAC images and enhances cross-domain generalization without requiring multi-domain data or complex cross-modal fusion strategies (Qiu et al., 1 Jan 2026).

1. Mathematical Principles: Fractional Fourier Transform Foundations

MFFA relies fundamentally on the fractional Fourier transform (FrFT), which generalizes the classical Fourier transform to a continuous family of intermediate domains. Given an embedding E(u0)RDE(u_0)\in\mathbb{R}^D, its pp-th order FrFT is defined as: FrFTp{E}(up)=Kp(u0,up)  E(u0)  du0,\mathrm{FrFT}_p\{E\}(u_p) = \int_{-\infty}^{\infty} K_p(u_0, u_p)\;E(u_0)\;du_0, with fractional order pp corresponding to angle α=pπ/2\alpha = p\pi/2. The kernel KpK_p is expressed as: Kp(u0,up)=Aαexp(j(12u02cotαu0upcscα+12up2cotα)),K_p(u_0, u_p) = A_\alpha \exp\Bigl(j\bigl(\frac{1}{2}u_0^2\cot\alpha - u_0 u_p\csc\alpha + \frac{1}{2}u_p^2\cot\alpha\bigr)\Bigr), where Aα=1jcotα2πA_\alpha = \sqrt{\frac{1 - j\cot\alpha}{2\pi}}. For α\alpha as integer multiples of π\pi, KpK_p reduces to Dirac delta functions, recovering the identity and parity transforms.

Practical implementation uses a discrete FrFT (DFrFT) matrix FpRD×DF_p\in\mathbb{R}^{D\times D}, calculated via Hermite–Gaussian eigenvectors and eigenvalues: Fp[m,n]=k=1Dvk(m)λkpvk(n),F_p[m,n] = \sum_{k=1}^D v_k(m)\,\lambda_k^p\,v_k(n), yielding FrFTp(E)=FpE\mathrm{FrFT}_p(E) = F_p E.

2. MFFA Architecture and Pipeline Integration

MFFA is inserted into the OmniVaT pipeline following frozen modality-specific encoders:

  • Visual encoder (e.g., CLIP vision, outputs EvRDE^v\in\mathbb{R}^D)
  • Tactile encoder (EtRDE^t\in\mathbb{R}^D)
  • (Training only) Language encoder (ElRDE^l\in\mathbb{R}^D)

Each modality branch comprises two operations: FrFT processing and fractional Fourier attention (FrATT), with branch-specific learnable linear projections.

MFFA Processing Steps

Branch Linear Expansion FrFT Application FrATT Query
Language θe,l\theta_{e,l} FrFTp_p of θe,lEl\theta_{e,l}E^l FlF^l, FglF^l_g
Visual θe,v\theta_{e,v} FrFTp_p of θe,v(Ev+Fˉl)\theta_{e,v}(E^v + \bar F^l) Fˉl\bar F^l, FgvF^v_g
Tactile θe,t\theta_{e,t} FrFTp_p of θe,t(Et+Fˉl)\theta_{e,t}(E^t + \bar F^l) Fˉl\bar F^l, FgtF^t_g
  • After expansion via θe,\theta_{e,*}, the output is passed through DFrFT, then separated into real and imaginary parts, with ReLU activation applied to both.
  • Attention is performed using modality-specific queries, global-class tokens, and two-layer MLPs.

The output representations Fˉv\bar F^v and Fˉt\bar F^t from VIS and TAC are then processed by the Discrete Tree Generation (DTG) module for tree-based representation diversity.

3. Training Objectives, Losses, and Optimization

Three principal loss components coordinate MFFA training:

  • Classification Loss: LCE=iyilogpi\mathcal{L}_{\rm CE} = -\sum_{i} y_i\log p_i
  • Multimodal Alignment Loss in FrFT Domain:

LMMA=λ[DKL(FˉlFˉv)+DKL(FˉlFˉt)]\mathcal{L}_{\rm MMA} = \lambda\Bigl[D_{\rm KL}(\bar F^l\,\|\,\bar F^v) + D_{\rm KL}(\bar F^l\,\|\,\bar F^t)\Bigr]

LNOD=r=1RA(r)I(r)F\mathcal{L}_{\rm NOD} = \sum_{r=1}^R \|A^{(r)} - I^{(r)}\|_F

with A(r)A^{(r)} denoting the similarity matrix of nodes at depth rr.

Total loss is the sum: L=LCE+LMMA+LNOD\mathcal{L} = \mathcal{L}_{\rm CE} + \mathcal{L}_{\rm MMA} + \mathcal{L}_{\rm NOD}

Key hyperparameters include FrFT order p=0.5p=0.5, embedding extension E=4E=4, alignment weight λ=10\lambda=10, tree depth R=3R=3, initial learning rate 0.05 (cosine-annealed), SGD momentum 0.9, 20 epochs, and batch size 16.

4. Quantitative Impact and Ablation Analysis

Removal of LMMA\mathcal{L}_{\rm MMA} (eliminating FrFT-domain alignment but retaining MFFA) yields a macro-F1 increase on TAG → X from 40.6% (baseline) to 51.2%; reintroducing LMMA\mathcal{L}_{\rm MMA} further raises F1 to 52.5%. When compared to embedding-space fusion methods (e.g., LDC), MFFA plus alignment delivers up to 18.5–point improvement on OF 2.0 A → X.

Cosine-margin tests (comparing intra-class vs. inter-class similarity) show that OmniVaT’s MFFA module markedly increases the margin to 0.258, compared to ~$0.04$ for conventional approaches. Hyperparameter optimization indicates optimal performance for p=0.5p=0.5 (interpolating between embedding and frequency representations) and tree depth R=3R=3.

5. Insights into Modality Alignment: Spectral–Spatial Semantic Bridging

The rationale for FrFT-driven alignment is the continuum it provides between pure embedding domain (p=0p=0) and pure frequency domain (p=1p=1). Intermediate fractional orders (p0.5p\approx 0.5) uncover joint spectral–spatial semantics, facilitating disentanglement of class boundaries and strong congruence between VIS and TAC modalities. This mode of integration significantly reduces the modality gap by allowing visual and tactile features to coincide more naturally in the hybridized domain.

The language branch and its “anchor” Fˉl\bar F^l facilitate class-level consistency across modalities in both feature extraction and alignment, while preserving training-only dependence on textual prompts.

6. Practical Considerations, Limitations, and Performance

MFFA is tunable via its fractional order parameter (pp); optimal separation degrades at extremes (p=0p=0 or p=1p=1), suggesting careful calibration is critical. While MFFA adds computational overhead (FrFT transformations and attention layers), its execution remains real-time—75 FPS on an RTX 3090 with ViT-B/16 visual backbones.

A notable constraint is the training-stage reliance on a language encoder; class label prompts must be available during learning, although textual input is not required at test time. MFFA’s insertion yields robust single-domain generalization for visual–tactile object recognition.


The Multimodal Fractional Fourier Adapter constitutes a principled, architecture-level solution for multimodal alignment and generalization, leveraging fractional spectral–spatial transformations to unify disparate sensory domains (Qiu et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Fractional Fourier Adapter (MFFA).