CSI-SMoE: Sparse Experts for MIMO Transmission
- The paper demonstrates that integrating a sparse MoE in both encoder and decoder backbones improves PSNR by up to 1.5 dB over traditional models.
- CSI-SMoE dynamically selects experts using real-time channel state and semantic features, providing tailored processing for MIMO fading channels.
- Empirical results indicate significant LPIPS reductions and robust performance under variable SNR and bandwidth constraints.
The CSI-Sparse Mixture-of-Experts (CSI-SMoE) framework is an adaptive semantic communication system designed for end-to-end wireless image transmission over multi-input multi-output (MIMO) fading channels. The primary innovation of CSI-SMoE is the integration of a sparse Mixture-of-Experts (MoE) architecture within both the semantic encoder and decoder backbones. Expert selection in this architecture is governed jointly by real-time channel state information (CSI) and semantic features of image patches, yielding a communication system that adapts its internal processing to both instantaneous channel and content variations. This dual-driven routing strategy addresses the rigidity and limited robustness inherent in conventional single-driven or fixed-model semantic communication methods (Wan et al., 3 Apr 2026).
1. End-to-End System Model and Notation
CSI-SMoE operates by transmitting an RGB image over an MIMO fading channel with an average power constraint. The system consists of three key stages:
- Semantic Encoder : Maps to a complex-valued matrix of transmit symbols , with channel uses. The bandwidth ratio is defined as .
- MIMO Fading Channel: Modeled as , where is the instantaneous CSI and is additive complex Gaussian noise.
- Semantic Decoder 0: Recovers the image 1 from 2 and optionally 3.
The architectural novelty is the integration of an MoE “bottleneck” into the encoder and decoder networks, leveraging both real-time CSI features, 4, and semantic features, 5, for dynamic expert selection.
2. Adaptive MoE Swin Transformer Architecture
The core architectural unit is the Adaptive-MoE Swin-Transformer (AD–MoE ST) block. Each block processes patch-level features 6 using two sublayers:
- Window-based Multi-Head Self-Attention (W-MSA/SW-MSA):
7
- Adaptive-MoE MLP: Instead of a monolithic MLP, the design employs 8 always-active shared experts and 9 sparsely routed experts:
0
where 1, 2 denotes the adaptively chosen routed-expert subset, and 3 are routing probabilities.
Spatial resolution downsamples via layer-norm and fully connected "downsamplers" between stages. Patch embedding at the encoder input employs a 4 convolution of stride 2.
3. Dynamic Expert Gating and Routing Mechanism
Expert routing within each AD-MoE MLP is determined by a dynamic gating mechanism that synthesizes semantic and CSI feature vectors:
5
- 6: Global average pooling of 7 followed by a fully connected layer.
- 8: Preprocessed by flattening or singular-value summary of 9, then two FC layers.
A lightweight gating network produces scored logits 0 for 1 routed experts, converted to selection probabilities via softmax.
Threshold-based Top-K Routing: Rather than a fixed 2, active experts 3 are chosen by ranking the scores, initializing with the top score and accumulating further experts whose score gap does not exceed a threshold 4, up to a maximum of 5 experts. This makes the number of active experts both input and CSI dependent.
To encourage load balancing and prevent expert collapse, three training-time MoE regularizers are incorporated:
- Load-balance loss: 6.
- Entropy regularizer: 7.
- Variance regularizer: 8. Here, 9 denotes empirical activation frequency.
4. Semantic Encoder/Decoder Design and Loss Function
The semantic encoder backbone comprises four stages: convolutional patch embedding, two Swin Transformer stages, two AD-MoE Swin-Transformer stages, and a final fully connected normalization layer. Hyperparameters follow:
- Stages: 0
- Channel dimensions: 1
- Experts: shared 2, routed 3, max 4, threshold 5
The decoder is architecturally symmetric and reuses weights for AD-MoE Swin Transformer blocks.
The main training loss is the mean squared error (MSE):
6
An optional perceptual/LPIPS loss 7 can be included. The composite objective is:
8
with 9, 0, 1, 2 tuned by cross-validation.
5. Training Protocol and Implementation
Training is performed end-to-end with the Adam optimizer (initial learning rate 3) using 900 images from DIV2K for training and 24 4 crops from Kodak for evaluation. Fading channel realizations are sampled for each mini-batch. CSI preprocessing entails splitting real/imaginary parts, flattening to a 5 vector, and extracting features with two FC layers. The Swin window size is fixed at 6.
6. Quantitative Evaluation and Ablation
Performance is benchmarked against DeepJSCC and SwinJSCC baselines for 7 and 8 MIMO. CSI-SMoE demonstrates PSNR gains of 9–0 dB and LPIPS reductions of 1–2 relative to SwinJSCC at equivalent bandwidth and SNR. At 3 (Kodak, 4 MIMO), key PSNR results are:
| SNR (dB) | DeepJSCC | SwinJSCC | CSI-SMoE |
|---|---|---|---|
| 0 | 24.3 | 26.1 | 27.2 |
| 5 | 25.7 | 27.4 | 28.6 |
| 10 | 26.8 | 28.3 | 29.6 |
| 15 | 27.4 | 28.9 | 30.1 |
An ablation comparing routing signal sources at 5 dB (6) demonstrates joint content-CSI gating achieves the best PSNR:
| Routing Signal | PSNR (dB) |
|---|---|
| Content-Only | 28.3 |
| CSI-Only | 28.9 |
| Joint (Ours) | 29.6 |
Analysis of expert activation frequencies shows on average 7 routed experts are active per patch with all five experts utilized, validating the balancing regularizers.
7. Impact and Significance
CSI-SMoE jointly leverages instantaneous CSI and semantic content for sparse expert routing in each Transformer block, dynamically adapting both the number and identity of active experts through a data- and channel-driven Top-K mechanism. This flexible architecture overcomes the rigid coupling and limited adaptation of prior MoE-based or single-driven systems. Empirical results establish substantial improvements in PSNR and perceptual metrics at constant bandwidth and compute budget, positioning CSI-SMoE as an advanced solution for adaptive, efficient, and robust wireless image semantic communication under time-varying channels (Wan et al., 3 Apr 2026).