Papers
Topics
Authors
Recent
Search
2000 character limit reached

CSI-SMoE: Sparse Experts for MIMO Transmission

Updated 7 April 2026
  • The paper demonstrates that integrating a sparse MoE in both encoder and decoder backbones improves PSNR by up to 1.5 dB over traditional models.
  • CSI-SMoE dynamically selects experts using real-time channel state and semantic features, providing tailored processing for MIMO fading channels.
  • Empirical results indicate significant LPIPS reductions and robust performance under variable SNR and bandwidth constraints.

The CSI-Sparse Mixture-of-Experts (CSI-SMoE) framework is an adaptive semantic communication system designed for end-to-end wireless image transmission over multi-input multi-output (MIMO) fading channels. The primary innovation of CSI-SMoE is the integration of a sparse Mixture-of-Experts (MoE) architecture within both the semantic encoder and decoder backbones. Expert selection in this architecture is governed jointly by real-time channel state information (CSI) and semantic features of image patches, yielding a communication system that adapts its internal processing to both instantaneous channel and content variations. This dual-driven routing strategy addresses the rigidity and limited robustness inherent in conventional single-driven or fixed-model semantic communication methods (Wan et al., 3 Apr 2026).

1. End-to-End System Model and Notation

CSI-SMoE operates by transmitting an RGB image xRH×W×3\mathbf{x}\in\mathbb{R}^{H\times W\times 3} over an Nt×NrN_t\times N_r MIMO fading channel with an average power constraint. The system consists of three key stages:

  1. Semantic Encoder fenc()f_{\rm enc}(\cdot): Maps x\mathbf{x} to a complex-valued matrix of transmit symbols yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}, with KK channel uses. The bandwidth ratio is defined as R=KHW3R=\frac{K}{H\,W\,3}.
  2. MIMO Fading Channel: Modeled as y^=Hy+n\hat{\mathbf{y}} = \mathbf{H}\mathbf{y} + \mathbf{n}, where HCNr×Nt\mathbf{H}\in\mathbb{C}^{N_r\times N_t} is the instantaneous CSI and nCN(0,σ2I)\mathbf{n}\sim\mathcal{CN}(\mathbf{0},\sigma^2\mathbf{I}) is additive complex Gaussian noise.
  3. Semantic Decoder Nt×NrN_t\times N_r0: Recovers the image Nt×NrN_t\times N_r1 from Nt×NrN_t\times N_r2 and optionally Nt×NrN_t\times N_r3.

The architectural novelty is the integration of an MoE “bottleneck” into the encoder and decoder networks, leveraging both real-time CSI features, Nt×NrN_t\times N_r4, and semantic features, Nt×NrN_t\times N_r5, for dynamic expert selection.

2. Adaptive MoE Swin Transformer Architecture

The core architectural unit is the Adaptive-MoE Swin-Transformer (AD–MoE ST) block. Each block processes patch-level features Nt×NrN_t\times N_r6 using two sublayers:

Nt×NrN_t\times N_r7

  • Adaptive-MoE MLP: Instead of a monolithic MLP, the design employs Nt×NrN_t\times N_r8 always-active shared experts and Nt×NrN_t\times N_r9 sparsely routed experts:

fenc()f_{\rm enc}(\cdot)0

where fenc()f_{\rm enc}(\cdot)1, fenc()f_{\rm enc}(\cdot)2 denotes the adaptively chosen routed-expert subset, and fenc()f_{\rm enc}(\cdot)3 are routing probabilities.

Spatial resolution downsamples via layer-norm and fully connected "downsamplers" between stages. Patch embedding at the encoder input employs a fenc()f_{\rm enc}(\cdot)4 convolution of stride 2.

3. Dynamic Expert Gating and Routing Mechanism

Expert routing within each AD-MoE MLP is determined by a dynamic gating mechanism that synthesizes semantic and CSI feature vectors:

fenc()f_{\rm enc}(\cdot)5

  • fenc()f_{\rm enc}(\cdot)6: Global average pooling of fenc()f_{\rm enc}(\cdot)7 followed by a fully connected layer.
  • fenc()f_{\rm enc}(\cdot)8: Preprocessed by flattening or singular-value summary of fenc()f_{\rm enc}(\cdot)9, then two FC layers.

A lightweight gating network produces scored logits x\mathbf{x}0 for x\mathbf{x}1 routed experts, converted to selection probabilities via softmax.

Threshold-based Top-K Routing: Rather than a fixed x\mathbf{x}2, active experts x\mathbf{x}3 are chosen by ranking the scores, initializing with the top score and accumulating further experts whose score gap does not exceed a threshold x\mathbf{x}4, up to a maximum of x\mathbf{x}5 experts. This makes the number of active experts both input and CSI dependent.

To encourage load balancing and prevent expert collapse, three training-time MoE regularizers are incorporated:

  • Load-balance loss: x\mathbf{x}6.
  • Entropy regularizer: x\mathbf{x}7.
  • Variance regularizer: x\mathbf{x}8. Here, x\mathbf{x}9 denotes empirical activation frequency.

4. Semantic Encoder/Decoder Design and Loss Function

The semantic encoder backbone comprises four stages: convolutional patch embedding, two Swin Transformer stages, two AD-MoE Swin-Transformer stages, and a final fully connected normalization layer. Hyperparameters follow:

  • Stages: yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}0
  • Channel dimensions: yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}1
  • Experts: shared yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}2, routed yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}3, max yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}4, threshold yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}5

The decoder is architecturally symmetric and reuses weights for AD-MoE Swin Transformer blocks.

The main training loss is the mean squared error (MSE):

yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}6

An optional perceptual/LPIPS loss yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}7 can be included. The composite objective is:

yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}8

with yCNt×K\mathbf{y}\in\mathbb{C}^{N_t\times K}9, KK0, KK1, KK2 tuned by cross-validation.

5. Training Protocol and Implementation

Training is performed end-to-end with the Adam optimizer (initial learning rate KK3) using 900 images from DIV2K for training and 24 KK4 crops from Kodak for evaluation. Fading channel realizations are sampled for each mini-batch. CSI preprocessing entails splitting real/imaginary parts, flattening to a KK5 vector, and extracting features with two FC layers. The Swin window size is fixed at KK6.

6. Quantitative Evaluation and Ablation

Performance is benchmarked against DeepJSCC and SwinJSCC baselines for KK7 and KK8 MIMO. CSI-SMoE demonstrates PSNR gains of KK9–R=KHW3R=\frac{K}{H\,W\,3}0 dB and LPIPS reductions of R=KHW3R=\frac{K}{H\,W\,3}1–R=KHW3R=\frac{K}{H\,W\,3}2 relative to SwinJSCC at equivalent bandwidth and SNR. At R=KHW3R=\frac{K}{H\,W\,3}3 (Kodak, R=KHW3R=\frac{K}{H\,W\,3}4 MIMO), key PSNR results are:

SNR (dB) DeepJSCC SwinJSCC CSI-SMoE
0 24.3 26.1 27.2
5 25.7 27.4 28.6
10 26.8 28.3 29.6
15 27.4 28.9 30.1

An ablation comparing routing signal sources at R=KHW3R=\frac{K}{H\,W\,3}5 dB (R=KHW3R=\frac{K}{H\,W\,3}6) demonstrates joint content-CSI gating achieves the best PSNR:

Routing Signal PSNR (dB)
Content-Only 28.3
CSI-Only 28.9
Joint (Ours) 29.6

Analysis of expert activation frequencies shows on average R=KHW3R=\frac{K}{H\,W\,3}7 routed experts are active per patch with all five experts utilized, validating the balancing regularizers.

7. Impact and Significance

CSI-SMoE jointly leverages instantaneous CSI and semantic content for sparse expert routing in each Transformer block, dynamically adapting both the number and identity of active experts through a data- and channel-driven Top-K mechanism. This flexible architecture overcomes the rigid coupling and limited adaptation of prior MoE-based or single-driven systems. Empirical results establish substantial improvements in PSNR and perceptual metrics at constant bandwidth and compute budget, positioning CSI-SMoE as an advanced solution for adaptive, efficient, and robust wireless image semantic communication under time-varying channels (Wan et al., 3 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CSI-Sparse Mixture-of-Experts (CSI-SMoE).