CSI-SMoE: Sparse Experts for MIMO Transmission

Updated 7 April 2026

The paper demonstrates that integrating a sparse MoE in both encoder and decoder backbones improves PSNR by up to 1.5 dB over traditional models.
CSI-SMoE dynamically selects experts using real-time channel state and semantic features, providing tailored processing for MIMO fading channels.
Empirical results indicate significant LPIPS reductions and robust performance under variable SNR and bandwidth constraints.

The CSI-Sparse Mixture-of-Experts (CSI-SMoE) framework is an adaptive semantic communication system designed for end-to-end wireless image transmission over multi-input multi-output (MIMO) fading channels. The primary innovation of CSI-SMoE is the integration of a sparse Mixture-of-Experts (MoE) architecture within both the semantic encoder and decoder backbones. Expert selection in this architecture is governed jointly by real-time channel state information (CSI) and semantic features of image patches, yielding a communication system that adapts its internal processing to both instantaneous channel and content variations. This dual-driven routing strategy addresses the rigidity and limited robustness inherent in conventional single-driven or fixed-model semantic communication methods (Wan et al., 3 Apr 2026).

1. End-to-End System Model and Notation

CSI-SMoE operates by transmitting an RGB image $\mathbf{x}\in\mathbb{R}^{H\times W\times 3}$ over an $N_t\times N_r$ MIMO fading channel with an average power constraint. The system consists of three key stages:

Semantic Encoder $f_{\rm enc}(\cdot)$ : Maps $\mathbf{x}$ to a complex-valued matrix of transmit symbols $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ , with $K$ channel uses. The bandwidth ratio is defined as $R=\frac{K}{H\,W\,3}$ .
MIMO Fading Channel: Modeled as $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y} + \mathbf{n}$ , where $\mathbf{H}\in\mathbb{C}^{N_r\times N_t}$ is the instantaneous CSI and $\mathbf{n}\sim\mathcal{CN}(\mathbf{0},\sigma^2\mathbf{I})$ is additive complex Gaussian noise.
Semantic Decoder $N_t\times N_r$ 0: Recovers the image $N_t\times N_r$ 1 from $N_t\times N_r$ 2 and optionally $N_t\times N_r$ 3.

The architectural novelty is the integration of an MoE “bottleneck” into the encoder and decoder networks, leveraging both real-time CSI features, $N_t\times N_r$ 4, and semantic features, $N_t\times N_r$ 5, for dynamic expert selection.

2. Adaptive MoE Swin Transformer Architecture

The core architectural unit is the Adaptive-MoE Swin-Transformer (AD–MoE ST) block. Each block processes patch-level features $N_t\times N_r$ 6 using two sublayers:

Window-based Multi-Head Self-Attention (W-MSA/SW-MSA):

$N_t\times N_r$ 7

Adaptive-MoE MLP: Instead of a monolithic MLP, the design employs $N_t\times N_r$ 8 always-active shared experts and $N_t\times N_r$ 9 sparsely routed experts:

$f_{\rm enc}(\cdot)$ 0

where $f_{\rm enc}(\cdot)$ 1, $f_{\rm enc}(\cdot)$ 2 denotes the adaptively chosen routed-expert subset, and $f_{\rm enc}(\cdot)$ 3 are routing probabilities.

Spatial resolution downsamples via layer-norm and fully connected "downsamplers" between stages. Patch embedding at the encoder input employs a $f_{\rm enc}(\cdot)$ 4 convolution of stride 2.

3. Dynamic Expert Gating and Routing Mechanism

Expert routing within each AD-MoE MLP is determined by a dynamic gating mechanism that synthesizes semantic and CSI feature vectors:

$f_{\rm enc}(\cdot)$ 5

$f_{\rm enc}(\cdot)$ 6: Global average pooling of $f_{\rm enc}(\cdot)$ 7 followed by a fully connected layer.
$f_{\rm enc}(\cdot)$ 8: Preprocessed by flattening or singular-value summary of $f_{\rm enc}(\cdot)$ 9, then two FC layers.

A lightweight gating network produces scored logits $\mathbf{x}$ 0 for $\mathbf{x}$ 1 routed experts, converted to selection probabilities via softmax.

Threshold-based Top-K Routing: Rather than a fixed $\mathbf{x}$ 2, active experts $\mathbf{x}$ 3 are chosen by ranking the scores, initializing with the top score and accumulating further experts whose score gap does not exceed a threshold $\mathbf{x}$ 4, up to a maximum of $\mathbf{x}$ 5 experts. This makes the number of active experts both input and CSI dependent.

To encourage load balancing and prevent expert collapse, three training-time MoE regularizers are incorporated:

Load-balance loss: $\mathbf{x}$ 6.
Entropy regularizer: $\mathbf{x}$ 7.
Variance regularizer: $\mathbf{x}$ 8. Here, $\mathbf{x}$ 9 denotes empirical activation frequency.

4. Semantic Encoder/Decoder Design and Loss Function

The semantic encoder backbone comprises four stages: convolutional patch embedding, two Swin Transformer stages, two AD-MoE Swin-Transformer stages, and a final fully connected normalization layer. Hyperparameters follow:

Stages: $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 0
Channel dimensions: $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 1
Experts: shared $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 2, routed $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 3, max $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 4, threshold $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 5

The decoder is architecturally symmetric and reuses weights for AD-MoE Swin Transformer blocks.

The main training loss is the mean squared error (MSE):

$\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 6

An optional perceptual/LPIPS loss $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 7 can be included. The composite objective is:

$\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 8

with $\mathbf{y}\in\mathbb{C}^{N_t\times K}$ 9, $K$ 0, $K$ 1, $K$ 2 tuned by cross-validation.

5. Training Protocol and Implementation

Training is performed end-to-end with the Adam optimizer (initial learning rate $K$ 3) using 900 images from DIV2K for training and 24 $K$ 4 crops from Kodak for evaluation. Fading channel realizations are sampled for each mini-batch. CSI preprocessing entails splitting real/imaginary parts, flattening to a $K$ 5 vector, and extracting features with two FC layers. The Swin window size is fixed at $K$ 6.

6. Quantitative Evaluation and Ablation

Performance is benchmarked against DeepJSCC and SwinJSCC baselines for $K$ 7 and $K$ 8 MIMO. CSI-SMoE demonstrates PSNR gains of $K$ 9– $R=\frac{K}{H\,W\,3}$ 0 dB and LPIPS reductions of $R=\frac{K}{H\,W\,3}$ 1– $R=\frac{K}{H\,W\,3}$ 2 relative to SwinJSCC at equivalent bandwidth and SNR. At $R=\frac{K}{H\,W\,3}$ 3 (Kodak, $R=\frac{K}{H\,W\,3}$ 4 MIMO), key PSNR results are:

SNR (dB)	DeepJSCC	SwinJSCC	CSI-SMoE
0	24.3	26.1	27.2
5	25.7	27.4	28.6
10	26.8	28.3	29.6
15	27.4	28.9	30.1

An ablation comparing routing signal sources at $R=\frac{K}{H\,W\,3}$ 5 dB ( $R=\frac{K}{H\,W\,3}$ 6) demonstrates joint content-CSI gating achieves the best PSNR:

Routing Signal	PSNR (dB)
Content-Only	28.3
CSI-Only	28.9
Joint (Ours)	29.6

Analysis of expert activation frequencies shows on average $R=\frac{K}{H\,W\,3}$ 7 routed experts are active per patch with all five experts utilized, validating the balancing regularizers.

7. Impact and Significance

CSI-SMoE jointly leverages instantaneous CSI and semantic content for sparse expert routing in each Transformer block, dynamically adapting both the number and identity of active experts through a data- and channel-driven Top-K mechanism. This flexible architecture overcomes the rigid coupling and limited adaptation of prior MoE-based or single-driven systems. Empirical results establish substantial improvements in PSNR and perceptual metrics at constant bandwidth and compute budget, positioning CSI-SMoE as an advanced solution for adaptive, efficient, and robust wireless image semantic communication under time-varying channels (Wan et al., 3 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Adaptive Semantic Communication for Wireless Image Transmission Leveraging Mixture-of-Experts Mechanism (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CSI-Sparse Mixture-of-Experts (CSI-SMoE).

CSI-SMoE: Sparse Experts for MIMO Transmission

1. End-to-End System Model and Notation

2. Adaptive MoE Swin Transformer Architecture

3. Dynamic Expert Gating and Routing Mechanism

4. Semantic Encoder/Decoder Design and Loss Function

5. Training Protocol and Implementation

6. Quantitative Evaluation and Ablation

7. Impact and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CSI-SMoE: Sparse Experts for MIMO Transmission

1. End-to-End System Model and Notation

2. Adaptive MoE Swin Transformer Architecture

3. Dynamic Expert Gating and Routing Mechanism

4. Semantic Encoder/Decoder Design and Loss Function

5. Training Protocol and Implementation

6. Quantitative Evaluation and Ablation

7. Impact and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research