Convolution-based Mixture of Experts (CMoE)
- Convolution-based Mixture of Experts is an architectural framework that employs a feature extractor, specialized MLP experts, and a gating mechanism to adaptively route inputs.
- The methodology uses balanced regularization combined with cross-entropy loss to achieve significant accuracy improvements, with gains of 5–10% in underwater acoustic target recognition.
- CMoE offers computational savings and enhanced interpretability by enabling per-input expert specialization and dynamic routing across both acoustic and vision tasks.
Convolution-based Mixture of Experts (CMoE) is an architectural paradigm that integrates the Mixture of Experts (MoE) framework with convolutional and MLP-based models, enabling structured, input-adaptive routing and specialization of sub-networks. This approach exploits a collection of independent expert networks and a dynamic routing mechanism, facilitating fine-grained modeling of complex, heterogeneous input spaces. CMoE variants have been proposed for domains such as underwater acoustic target recognition, where conventional approaches struggle with high intra-class diversity and distributional complexity (Xie et al., 2024). Related methods extend the paradigm to per-channel dynamic gating in convolutional networks, yielding computational savings and improvements in accuracy on vision tasks (Wang et al., 2018).
1. Architectural Framework
At its core, CMoE comprises three key components: a backbone feature extractor, a bank of expert MLPs, and a routing/gating mechanism.
- Feature Backbone: For underwater acoustic target recognition, the input (e.g., spectrograms) is mapped to a fixed-dimensional embedding by a customized ResNet-AP backbone (Xie et al., 2024).
- Expert Networks: A set of experts , each implemented as a two-layer MLP with batch normalization and ReLU activation. Mathematical form:
where , , are bias parameters, , and is the number of classes.
- Gating/Routing Network: A single linear layer 0 parameterized by 1, 2, which generates expert scores 3. Selection probabilities are computed as 4, with one expert selected by 5 for each sample.
An optional residual expert 6 (not gated) can be included, producing logits 7. The final logits are
8
This structure provides independent parameter spaces for experts and enables specialization for highly variable input distributions (Xie et al., 2024).
2. Mathematical Formulation and Learning Objective
The CMoE loss function integrates standard cross-entropy loss with a regularization term enforcing balanced utilization among experts. For a batch of 9 samples:
- Routing statistics:
- 0 (fraction of samples routed to expert 1)
- 2 (average soft assignment to expert 3)
- Balancing regularization:
4
with a typical value 5.
The overall loss: 6 where 7 denotes the cross-entropy loss.
Optimization proceeds via AdamW with a constant learning rate (maximum 8), weight decay 9, and a training horizon of 200 epochs, with batch size and scheduling tuned on validation splits (Xie et al., 2024).
3. Computational Structure and Connections to Other MoE Approaches
The convolution-based mixture of experts described above operates at the level of global embeddings and expert networks. In contrast, DeepMoE (Wang et al., 2018) applies the MoE principle locally to all convolutional layers within a deep model:
- Every convolutional layer is reinterpreted as a sum of “expert” input channels, with a per-layer gating vector 0 computing dynamic selection and scaling.
- Gating vectors are derived from a small shared embedding network 1, with per-layer independent heads.
- The approach enables per-example, per-layer dynamic sparsification, leading to computational thrift while maintaining or expanding representational capacity.
- Sparse execution is encouraged via an 2 penalty on the gates.
- Empirical studies confirm accuracy gains and FLOP reductions in vision benchmarks.
A key distinction: the CMoE (Xie et al., 2024) approach gates at the level of whole-expert MLPs using high-level embeddings, whereas DeepMoE (Wang et al., 2018) gates individual channels or groups at each convolutional layer.
4. Experimental Validation and Quantitative Results
In the context of underwater acoustic target recognition (Xie et al., 2024), CMoE was evaluated on Shipsear (9 classes), DTIL (2 classes, private), and DeepShip (4 classes) datasets. Acoustic features included STFT, Mel spectrogram, Bark spectrogram, and CQT spectrogram. The evaluation metric was segment-level accuracy, with train/test splits separated by audio track to prevent leakage.
Performance summary (segment-level accuracy):
| Dataset/Feature | Baseline (ResNet-AP) | CMoE | CMoE + balance | RCMoE + balance |
|---|---|---|---|---|
| Shipsear (STFT) | 75.24 | 84.91 | 86.21 | 85.34 |
| Shipsear (Mel) | 77.14 | 83.59 | 85.35 | 84.48 |
| Shipsear (Bark) | 72.86 | 81.33 | 84.48 | 83.62 |
| Shipsear (CQT) | 73.33 | 80.48 | 82.76 | 82.76 |
| DTIL (STFT) | 95.93 | 96.61 | 97.89 | 98.17 |
| DeepShip (CQT) | 77.82 | 77.09 | 79.62 | 78.76 |
These results indicate absolute accuracy gains of 5–10% on Shipsear and significant improvements on DTIL and DeepShip. In fine-grained analyses, CMoE achieved close to 100% accuracy on under-represented small classes where baselines underperformed (sub-50%) (Xie et al., 2024).
Ablation studies confirmed that:
- The balancing regularization prevents under-training or collapse of individual experts.
- The optional residual expert benefits recovery from routings errors but at times reduces sparsity advantages.
- The optimal number of experts depends on dataset size and feature redundancy; higher 3 risks overfitting in small-data regimes.
5. Interpretability and Visualization
Extensive visualization elucidated the internal dynamics of CMoE (Xie et al., 2024):
- Expert clustering: Heatmaps revealed that, for 4, large-vessel classes (e.g., ocean liner, ro-ro) preferentially route to distinct experts versus small boats, reflecting emergent specialization tied to acoustic characteristics.
- Per-sample routing: Routings for spectrograms exhibiting different within-class motion states (e.g., motorboat start/stop, arrival, passing) showed shared and unique expert routes, demonstrating disentanglement of intra-class diversity.
- Inter-class discrimination: Visually similar spectrograms from different classes (e.g., motorboat vs. passenger-ship) were routed to distinct experts, indicating resilience to inter-class similarity.
This suggests that the gating network learns semantically meaningful and task-aligned partitions, with expert assignment reflecting physically relevant factors such as vessel type or motion.
6. Comparative Perspective and Extensions
The convolution-based mixture of experts paradigm enables both increased model expressivity and efficient computation by restricting the activated parameter subspace per input. Within deep vision models (Wang et al., 2018), treating channels as experts and leveraging per-layer sparse gating realizes an effective exponential ensemble of subnetworks under dynamic selection. This yields higher accuracy and computational economies compared to both static channel pruning and naive widening strategies.
Empirical evidence indicates that multi-layer application of MoE outperforms single-point gating and, when paired with careful regularization, preserves computational cost at a fixed or improved accuracy.
Proposed future directions for CMoE-type models include:
- Incorporation of physically interpretable routing signals (e.g., propeller blade count relevant for vessel classification).
- Exploration of deeper, multi-stage expert/routing hierarchies.
- Extension to structured operators and attention mechanisms by leveraging learned global or hierarchical embeddings for gating (Wang et al., 2018, Xie et al., 2024).
The CMoE approach represents a significant advance in the fine-grained modeling of complex, heterogenous data regimes, demonstrating improved robustness and accuracy in challenging real-world acoustic environments.