Papers
Topics
Authors
Recent
Search
2000 character limit reached

Convolution-based Mixture of Experts (CMoE)

Updated 4 April 2026
  • Convolution-based Mixture of Experts is an architectural framework that employs a feature extractor, specialized MLP experts, and a gating mechanism to adaptively route inputs.
  • The methodology uses balanced regularization combined with cross-entropy loss to achieve significant accuracy improvements, with gains of 5–10% in underwater acoustic target recognition.
  • CMoE offers computational savings and enhanced interpretability by enabling per-input expert specialization and dynamic routing across both acoustic and vision tasks.

Convolution-based Mixture of Experts (CMoE) is an architectural paradigm that integrates the Mixture of Experts (MoE) framework with convolutional and MLP-based models, enabling structured, input-adaptive routing and specialization of sub-networks. This approach exploits a collection of independent expert networks and a dynamic routing mechanism, facilitating fine-grained modeling of complex, heterogeneous input spaces. CMoE variants have been proposed for domains such as underwater acoustic target recognition, where conventional approaches struggle with high intra-class diversity and distributional complexity (Xie et al., 2024). Related methods extend the paradigm to per-channel dynamic gating in convolutional networks, yielding computational savings and improvements in accuracy on vision tasks (Wang et al., 2018).

1. Architectural Framework

At its core, CMoE comprises three key components: a backbone feature extractor, a bank of expert MLPs, and a routing/gating mechanism.

  • Feature Backbone: For underwater acoustic target recognition, the input xRC×T×Fx \in \mathbb{R}^{C \times T \times F} (e.g., spectrograms) is mapped to a fixed-dimensional embedding r=F(x)R512r = F(x) \in \mathbb{R}^{512} by a customized ResNet-AP backbone (Xie et al., 2024).
  • Expert Networks: A set of mm experts {E1,,Em}\{E_1, \ldots, E_m\}, each implemented as a two-layer MLP with batch normalization and ReLU activation. Mathematical form:

Ej(r)=Wj(2)σ(BN(Wj(1)r+bj(1)))+bj(2)E_j(r) = W_j^{(2)} \, \sigma( BN( W_j^{(1)} r + b_j^{(1)} ) ) + b_j^{(2)}

where Wj(1)R128×512W_j^{(1)} \in \mathbb{R}^{128 \times 512}, Wj(2)RC×128W_j^{(2)} \in \mathbb{R}^{C \times 128}, bj(1),bj(2)b_j^{(1)}, b_j^{(2)} are bias parameters, σ()=ReLU()\sigma(\cdot) = \operatorname{ReLU}(\cdot), and CC is the number of classes.

  • Gating/Routing Network: A single linear layer r=F(x)R512r = F(x) \in \mathbb{R}^{512}0 parameterized by r=F(x)R512r = F(x) \in \mathbb{R}^{512}1, r=F(x)R512r = F(x) \in \mathbb{R}^{512}2, which generates expert scores r=F(x)R512r = F(x) \in \mathbb{R}^{512}3. Selection probabilities are computed as r=F(x)R512r = F(x) \in \mathbb{R}^{512}4, with one expert selected by r=F(x)R512r = F(x) \in \mathbb{R}^{512}5 for each sample.

An optional residual expert r=F(x)R512r = F(x) \in \mathbb{R}^{512}6 (not gated) can be included, producing logits r=F(x)R512r = F(x) \in \mathbb{R}^{512}7. The final logits are

r=F(x)R512r = F(x) \in \mathbb{R}^{512}8

This structure provides independent parameter spaces for experts and enables specialization for highly variable input distributions (Xie et al., 2024).

2. Mathematical Formulation and Learning Objective

The CMoE loss function integrates standard cross-entropy loss with a regularization term enforcing balanced utilization among experts. For a batch of r=F(x)R512r = F(x) \in \mathbb{R}^{512}9 samples:

  • Routing statistics:
    • mm0 (fraction of samples routed to expert mm1)
    • mm2 (average soft assignment to expert mm3)
  • Balancing regularization:

mm4

with a typical value mm5.

The overall loss: mm6 where mm7 denotes the cross-entropy loss.

Optimization proceeds via AdamW with a constant learning rate (maximum mm8), weight decay mm9, and a training horizon of 200 epochs, with batch size and scheduling tuned on validation splits (Xie et al., 2024).

3. Computational Structure and Connections to Other MoE Approaches

The convolution-based mixture of experts described above operates at the level of global embeddings and expert networks. In contrast, DeepMoE (Wang et al., 2018) applies the MoE principle locally to all convolutional layers within a deep model:

  • Every convolutional layer is reinterpreted as a sum of “expert” input channels, with a per-layer gating vector {E1,,Em}\{E_1, \ldots, E_m\}0 computing dynamic selection and scaling.
  • Gating vectors are derived from a small shared embedding network {E1,,Em}\{E_1, \ldots, E_m\}1, with per-layer independent heads.
  • The approach enables per-example, per-layer dynamic sparsification, leading to computational thrift while maintaining or expanding representational capacity.
  • Sparse execution is encouraged via an {E1,,Em}\{E_1, \ldots, E_m\}2 penalty on the gates.
  • Empirical studies confirm accuracy gains and FLOP reductions in vision benchmarks.

A key distinction: the CMoE (Xie et al., 2024) approach gates at the level of whole-expert MLPs using high-level embeddings, whereas DeepMoE (Wang et al., 2018) gates individual channels or groups at each convolutional layer.

4. Experimental Validation and Quantitative Results

In the context of underwater acoustic target recognition (Xie et al., 2024), CMoE was evaluated on Shipsear (9 classes), DTIL (2 classes, private), and DeepShip (4 classes) datasets. Acoustic features included STFT, Mel spectrogram, Bark spectrogram, and CQT spectrogram. The evaluation metric was segment-level accuracy, with train/test splits separated by audio track to prevent leakage.

Performance summary (segment-level accuracy):

Dataset/Feature Baseline (ResNet-AP) CMoE CMoE + balance RCMoE + balance
Shipsear (STFT) 75.24 84.91 86.21 85.34
Shipsear (Mel) 77.14 83.59 85.35 84.48
Shipsear (Bark) 72.86 81.33 84.48 83.62
Shipsear (CQT) 73.33 80.48 82.76 82.76
DTIL (STFT) 95.93 96.61 97.89 98.17
DeepShip (CQT) 77.82 77.09 79.62 78.76

These results indicate absolute accuracy gains of 5–10% on Shipsear and significant improvements on DTIL and DeepShip. In fine-grained analyses, CMoE achieved close to 100% accuracy on under-represented small classes where baselines underperformed (sub-50%) (Xie et al., 2024).

Ablation studies confirmed that:

  • The balancing regularization prevents under-training or collapse of individual experts.
  • The optional residual expert benefits recovery from routings errors but at times reduces sparsity advantages.
  • The optimal number of experts depends on dataset size and feature redundancy; higher {E1,,Em}\{E_1, \ldots, E_m\}3 risks overfitting in small-data regimes.

5. Interpretability and Visualization

Extensive visualization elucidated the internal dynamics of CMoE (Xie et al., 2024):

  • Expert clustering: Heatmaps revealed that, for {E1,,Em}\{E_1, \ldots, E_m\}4, large-vessel classes (e.g., ocean liner, ro-ro) preferentially route to distinct experts versus small boats, reflecting emergent specialization tied to acoustic characteristics.
  • Per-sample routing: Routings for spectrograms exhibiting different within-class motion states (e.g., motorboat start/stop, arrival, passing) showed shared and unique expert routes, demonstrating disentanglement of intra-class diversity.
  • Inter-class discrimination: Visually similar spectrograms from different classes (e.g., motorboat vs. passenger-ship) were routed to distinct experts, indicating resilience to inter-class similarity.

This suggests that the gating network learns semantically meaningful and task-aligned partitions, with expert assignment reflecting physically relevant factors such as vessel type or motion.

6. Comparative Perspective and Extensions

The convolution-based mixture of experts paradigm enables both increased model expressivity and efficient computation by restricting the activated parameter subspace per input. Within deep vision models (Wang et al., 2018), treating channels as experts and leveraging per-layer sparse gating realizes an effective exponential ensemble of subnetworks under dynamic selection. This yields higher accuracy and computational economies compared to both static channel pruning and naive widening strategies.

Empirical evidence indicates that multi-layer application of MoE outperforms single-point gating and, when paired with careful regularization, preserves computational cost at a fixed or improved accuracy.

Proposed future directions for CMoE-type models include:

The CMoE approach represents a significant advance in the fine-grained modeling of complex, heterogenous data regimes, demonstrating improved robustness and accuracy in challenging real-world acoustic environments.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convolution-based Mixture of Experts (CMoE).