Phase-Based Fusion Module (PFM)
- Phase-Based Fusion Module is an architectural component that leverages phase information to integrate complementary features across different modalities.
- It employs frequency-domain operations and attention mechanisms to dynamically align and enhance feature representations.
- PFMs improve performance in tasks like speaker recognition, multi-modality image fusion, and burst image restoration compared to traditional fusion methods.
A Phase-Based Fusion Module (PFM) is a class of architectural component designed to integrate and align features in signal-processing or computer vision tasks using properties derived from the phase domain. Unlike traditional fusion mechanisms that primarily emphasize magnitude or direct concatenation, PFMs leverage phase information—frequently through explicit frequency-domain operations or attention mechanisms—to achieve contextually aware inter-feature or inter-modality fusion. Contemporary instantiations of PFMs have been deployed in domains such as speaker recognition, multi-modality image fusion, and burst image restoration, demonstrating improvements over baseline and prior feature-level fusion techniques by selectively attending to cross-domain or cross-frame correlations.
1. Architectural Principles and Roles of PFMs
PFMs are typically situated between parallel feature extraction streams and downstream tasks such as classification, restoration, or embedding aggregation. Their core objective is to produce fused representations that encapsulate both the complementary and redundant information from their respective inputs, with adaptive emphasis dictated by the phase structure or phase correlations present in the data.
- In speaker recognition, PFMs learn, on a per-utterance basis, to dynamically prioritize magnitude or phase-derived signals in the formation of robust embeddings. This is achieved via a cross-domain co-attention block, yielding channel-wise re-weighted feature maps before pooling and classification (Su et al., 17 Oct 2025).
- In multi-modality image fusion, PFMs—explicitly organized as dual-phase modules—sequentially perform a parameter-free channel exchange at a global (shallow) phase, followed by state-space modeling and detailed interaction in a deep phase, to combine and refine modality-specific representations (Li et al., 2024).
- In burst flicker removal, PFM modules align burst-frame features through explicit phase-difference computations in the frequency domain, ensuring that periodic artifacts are correctly registered and suppressed prior to aggregation (Qu et al., 24 Mar 2026).
2. Mathematical Formulations
PFMs are characterized by a common reliance on phase-domain operations, but instantiations differ by modality:
Co-attention Fusion for Speaker Recognition:
Let denote magnitude and phase feature maps. These are linearly projected into query and key spaces, then a channel-wise correlation matrix is computed as , followed by dual softmax normalizations to produce row- and column-based attention weights , . Feature maps are then re-weighted and reshaped to original spatial dimensions. The core sequence:
- ,
- ,
- , (Su et al., 17 Oct 2025)
Frequency-Domain Phase Correlation for Burst Fusion:
Given features ,
- Compute per-channel FFTs, decompose to phase/amplitude.
- For each reference–base pair, obtain the phase-difference: .
- Project real/imag components of via conv+ to obtain frequency-domain weight maps .
- Apply to filter , return to spatial domain via IFFT.
- Concatenate all spatial features and fuse via a conv+ReLU (Qu et al., 24 Mar 2026).
Dual-phase Channel Exchange and Mamba-based Deep Fusion:
- Shallow: Exchange channels between modalities according to a binary mask, then sum or -weight exchanged features.
- Deep: Refined fusion via multi-modal state-space modeling (M³ block), with gating, residual connections, and selective scan for efficient long-range interaction (Li et al., 2024).
3. Input Representations and Upstream Processing
PFMs operate over rich, modality-specific feature maps obtained from upstream networks:
- Magnitude (Spectral): Log-Mel filterbank (FBank) coefficients with , , 25 ms windows, CMVN applied (192 dims/frame).
- Phase: Modified Group Delay (MODGD) features, cepstral smoothing, standardized imaginary parts, similar windowing and CMVN (201 dims/frame) (Su et al., 17 Oct 2025).
- Image patches/tokens: In multi-modality or burst settings, feature tokens derived by CNNs, patch embeddings, or hybrid convolutions (Li et al., 2024, Qu et al., 24 Mar 2026).
Each representation is processed by modality-specific encoders (e.g., Thin-ResNet34, CNN+Mamba blocks) to yield high-level feature tensors for fusion.
4. Dataflow and Network Implementation
PFMs are implemented as compact, efficient modules:
- Co-attention PFMs use 1×1 convolutions, linear projections, and softmax normalization, acting on channel-flattened maps.
- Frequency-domain PFMs require per-feature channel FFT/IFFT transforms, convolutions, elementwise multiplication for frequency masking, and a final spatial convolution for aggregation (Qu et al., 24 Mar 2026).
- Dual-phase PFMs first execute parameter-free shallow fusion (channel exchange + sum), then apply repeated M³ deep fusion blocks, with each branch composed of LayerNorm, MLP expansion, convolution, state-space scan, gating, modality merging, and residual addition (Li et al., 2024).
A sample pseudo-code for frequency-domain PFM:
1 2 3 4 5 6 7 8 9 10 |
def PFM(X0, X1, X2): # Each X: HxWxC X0f, X1f, X2f = FFT2D(X0), FFT2D(X1), FFT2D(X2) S0 = exp(1j * (angle(X0f) - angle(X1f))) S2 = exp(1j * (angle(X2f) - angle(X1f))) W0 = sigmoid(conv3x3(cat(Re(S0), Im(S0)), out_channels=C)) W2 = sigmoid(conv3x3(cat(Re(S2), Im(S2)), out_channels=C)) X0f_, X2f_ = X0f * W0, X2f * W2 X0_, X2_ = IFFT2D(X0f_), IFFT2D(X2f_) F0 = ReLU(conv3x3(cat(X0_, X1, X2_), out_channels=C)) return F0 |
5. Optimization and Training Procedures
PFMs are trained end-to-end within their parent networks, often under the same gradient and optimization protocols as the entire architecture. Notable details:
- Losses: Cross-entropy, AAM-Softmax (with , ), reconstruction, and perceptual losses as dictated by the application.
- Optimizers: Adam is standard, with batch sizes (e.g., 64 in speaker recognition), weight decay (e.g., 0.05 /epoch) (Su et al., 17 Oct 2025), and data augmentation (e.g., sliding window cropping).
- No PFM-specific regularization has been deployed in the surveyed implementations.
6. Computational Complexity
PFMs exhibit favorable computational characteristics:
- Co-attention-based PFMs: Computational cost is dominated by operator multiplications and softmax normalization, scaling linearly with input spatial size and channel dimension (Su et al., 17 Oct 2025).
- Frequency-domain PFMs: Dominant costs are two FFTs + IFFTs per channel () and a small number of convolutions, negligible relative to transformer backbones (Qu et al., 24 Mar 2026).
- Dual-phase PFMs: Shallow fusion is ; deep fusion per M³ block is , with overall scaling nearly linear in sequence length and number of blocks—significantly cheaper than canonical Transformer fusion () (Li et al., 2024).
7. Empirical Performance and Comparative Analysis
PFMs confer measurable gains over baseline or standard fusion strategies:
| Task/Domain | Baseline | PFM Variant | Metric/Gain | Reference |
|---|---|---|---|---|
| Speaker ID (VoxCeleb1) | FBank: 96.48% Top-1 | Co-attn PFM: 97.20% | +0.82% abs.; Top-1 accuracy | (Su et al., 17 Oct 2025) |
| Burst Flicker Removal | 3x3 CNN baseline | Phase-corr. PFM | +0.279 dB PSNR | (Qu et al., 24 Mar 2026) |
| IR-Visible Image Fusion (MSRS) | Best prior SOTA | Dual-phase PFM | MI +0.07, VIF +0.03, [email protected] +0.003 | (Li et al., 2024) |
A critical finding across studies is that explicit modeling and exploitation of phase information—not merely concatenation or linear combination—enables more robust, context- and content-adaptive fusion. In speaker recognition, this leads to both improved Top-1 accuracy and reduced EER, with co-attention fusion outperforming classical decision- and feature-level fusion (Su et al., 17 Oct 2025). In flicker suppression, the ability to gate unreliable frequency bands directly improves PSNR and eliminates non-aligned artifacts (Qu et al., 24 Mar 2026). In image fusion, sequential phase-based operations deliver increases across mutual information (MI), visual information fidelity (VIF), and downstream object detection metrics (Li et al., 2024).
A plausible implication is that future general-purpose fusion architectures may benefit from embedding PFMs or analogous explicitly phase-driven modules, especially in contexts with structured, cross-modality or periodic phenomena.