Multimodal Fractional Fourier Adapter
- Multimodal Fractional Fourier Adapter is an architectural module that leverages the fractional Fourier transform to unify visual and tactile sensor embeddings in a joint frequency space.
- It integrates modality-specific attention mechanisms and learnable projections to align outputs from visual, tactile, and training-only language encoders.
- Empirical results demonstrate improved single-domain generalization, with macro-F1 scores increasing by up to 18.5 points over conventional fusion approaches.
The Multimodal Fractional Fourier Adapter (MFFA) is an architectural module designed to bridge the modality gap between visual and tactile sensor representations in embodied agents, particularly within the context of single domain generalization for multimodal visual-tactile learning (SDG-VTL). MFFA operates by projecting embeddings from visual (VIS), tactile (TAC), and (during training) language (LANG) encoders into a joint embedding-frequency space using the fractional Fourier transform (FrFT), followed by modality-specific attention and alignment mechanisms. This approach mitigates discrepancies between VIS and TAC images and enhances cross-domain generalization without requiring multi-domain data or complex cross-modal fusion strategies (Qiu et al., 1 Jan 2026).
1. Mathematical Principles: Fractional Fourier Transform Foundations
MFFA relies fundamentally on the fractional Fourier transform (FrFT), which generalizes the classical Fourier transform to a continuous family of intermediate domains. Given an embedding , its -th order FrFT is defined as: with fractional order corresponding to angle . The kernel is expressed as: where . For as integer multiples of , reduces to Dirac delta functions, recovering the identity and parity transforms.
Practical implementation uses a discrete FrFT (DFrFT) matrix , calculated via Hermite–Gaussian eigenvectors and eigenvalues: yielding .
2. MFFA Architecture and Pipeline Integration
MFFA is inserted into the OmniVaT pipeline following frozen modality-specific encoders:
- Visual encoder (e.g., CLIP vision, outputs )
- Tactile encoder ()
- (Training only) Language encoder ()
Each modality branch comprises two operations: FrFT processing and fractional Fourier attention (FrATT), with branch-specific learnable linear projections.
MFFA Processing Steps
| Branch | Linear Expansion | FrFT Application | FrATT Query |
|---|---|---|---|
| Language | FrFT of | , | |
| Visual | FrFT of | , | |
| Tactile | FrFT of | , |
- After expansion via , the output is passed through DFrFT, then separated into real and imaginary parts, with ReLU activation applied to both.
- Attention is performed using modality-specific queries, global-class tokens, and two-layer MLPs.
The output representations and from VIS and TAC are then processed by the Discrete Tree Generation (DTG) module for tree-based representation diversity.
3. Training Objectives, Losses, and Optimization
Three principal loss components coordinate MFFA training:
- Classification Loss:
- Multimodal Alignment Loss in FrFT Domain:
- Node Diversity Loss (DTG): Encourages diverse tree representations,
with denoting the similarity matrix of nodes at depth .
Total loss is the sum:
Key hyperparameters include FrFT order , embedding extension , alignment weight , tree depth , initial learning rate 0.05 (cosine-annealed), SGD momentum 0.9, 20 epochs, and batch size 16.
4. Quantitative Impact and Ablation Analysis
Removal of (eliminating FrFT-domain alignment but retaining MFFA) yields a macro-F1 increase on TAG → X from 40.6% (baseline) to 51.2%; reintroducing further raises F1 to 52.5%. When compared to embedding-space fusion methods (e.g., LDC), MFFA plus alignment delivers up to 18.5–point improvement on OF 2.0 A → X.
Cosine-margin tests (comparing intra-class vs. inter-class similarity) show that OmniVaT’s MFFA module markedly increases the margin to 0.258, compared to ~$0.04$ for conventional approaches. Hyperparameter optimization indicates optimal performance for (interpolating between embedding and frequency representations) and tree depth .
5. Insights into Modality Alignment: Spectral–Spatial Semantic Bridging
The rationale for FrFT-driven alignment is the continuum it provides between pure embedding domain () and pure frequency domain (). Intermediate fractional orders () uncover joint spectral–spatial semantics, facilitating disentanglement of class boundaries and strong congruence between VIS and TAC modalities. This mode of integration significantly reduces the modality gap by allowing visual and tactile features to coincide more naturally in the hybridized domain.
The language branch and its “anchor” facilitate class-level consistency across modalities in both feature extraction and alignment, while preserving training-only dependence on textual prompts.
6. Practical Considerations, Limitations, and Performance
MFFA is tunable via its fractional order parameter (); optimal separation degrades at extremes ( or ), suggesting careful calibration is critical. While MFFA adds computational overhead (FrFT transformations and attention layers), its execution remains real-time—75 FPS on an RTX 3090 with ViT-B/16 visual backbones.
A notable constraint is the training-stage reliance on a language encoder; class label prompts must be available during learning, although textual input is not required at test time. MFFA’s insertion yields robust single-domain generalization for visual–tactile object recognition.
The Multimodal Fractional Fourier Adapter constitutes a principled, architecture-level solution for multimodal alignment and generalization, leveraging fractional spectral–spatial transformations to unify disparate sensory domains (Qiu et al., 1 Jan 2026).