Papers
Topics
Authors
Recent
2000 character limit reached

Se-HiLo: Robust Semantic Image Transmission

Updated 31 December 2025
  • Se-HiLo is a noise-resilient semantic communication framework that employs finite scalar quantization and transformer-based high–low frequency decomposition to maintain transmission quality under AWGN.
  • It separates image representations into high and low frequencies, enabling tailored quantization that preserves both global structure and fine details.
  • Benchmarking on the Animal-10 dataset shows Se-HiLo outperforms leading baselines, achieving significant improvements in PSNR and SSIM across diverse SNR regimes.

Se-HiLo is a noise-resilient semantic communication framework for image transmission that couples a Finite Scalar Quantization (FSQ) module with a transformer-based high-and-low frequency decomposition. Developed to address the limitations of adversarial training and noise injection in semantic communication, Se-HiLo enforces encoded representations on predefined grids to enable robust transmission under unpredictable additive white Gaussian noise (AWGN), while transformer-driven decomposition preserves representational diversity. The architecture achieves quantifiable noise resilience, outperforms leading baselines across diverse SNR regimes, and obviates the need for complex retraining under new noise profiles (Xi et al., 10 Mar 2025).

1. Motivation and Problem Setting

Semantic communication systems leverage deep learning to encode information at the level of meaning, replacing classical bit-level transmission. In practical deployments, learned semantic vectors are subject to unpredictable semantic noise—arising from device interference, channel jamming, or errant post-processing. Such disturbances can push semantic codes into regions unrecognized by the decoder, sharply degrading reconstruction accuracy or functional reliability. Traditional approaches resort to adversarial training or synthetic noise injection, which results in limited adaptability (real-world noise diverges from synthetic distributions) and considerable computational cost, requiring retraining for each new noise profile. Se-HiLo proposes an alternative that directly enhances the noise resilience of encoded representations without adversarial or noise-injection procedures.

2. Finite Scalar Quantization (FSQ) and Noise Resilience

The FSQ module forms the core of Se-HiLo’s robustness. It quantizes each scalar in the semantic vector x=[x1,x2,...,xN]x = [x_1, x_2, ..., x_N] onto a uniform grid of mim_i levels per dimension:

vi,j=L+jΔi,j=0,1,...,mi1v_{i,j} = L + j \cdot \Delta_i, \quad j = 0, 1, ..., m_i - 1

where Δi=(UL)/(mi1)\Delta_i = (U - L)/(m_i - 1).

A raw input zRz \in \mathbb{R} is transformed via bounded scaling and offset adjustment, followed by quantization:

  • h=((LU)(1+ϵ))/2h = ((L-U)(1+\epsilon))/2, with small ϵ\epsilon
  • Offset o=0.5o = 0.5 if mim_i even, $0$ if odd
  • Shift s=arctanh(o/h)s = \text{arctanh}(o/h)
  • zbounded=tanh(z+s)hoz_{\text{bounded}} = \tanh(z + s) \cdot h - o
  • zscaled=αzboundedz_{\text{scaled}} = \alpha \cdot z_{\text{bounded}}
  • Quantization: zround=round(zscaled/Δ)Δz_{\text{round}} = \text{round}(z_{\text{scaled}}/\Delta) \cdot \Delta
  • Normalization for decoder: zQ=zround/(αh)z_Q = z_{\text{round}} / (\alpha \cdot h)

The resulting quantizer Q()Q(\cdot) is differentiable almost everywhere, using straight-through estimation for gradient propagation.

This design sacrifices some representational diversity—since every dimension is limited to mim_i levels—but guarantees that a perturbation up to Δ/2\Delta/2 does not alter the quantized code. The resilience probability under AWGN with standard deviation σ\sigma is:

Pcorrect=[erf(Δ/(22σ))]NP_\text{correct} = [\text{erf}(\Delta/(2 \sqrt{2}\,\sigma))]^N

To reduce exponential sensitivity to NN, Se-HiLo decomposes representation into high- and low-frequency branches, each quantized separately.

3. Transformer-Based High-and-Low Frequency Decomposition

Se-HiLo’s encoder leverages a transformer-based decomposition strategy, termed the HiLo Block, inspired by prior work on accelerated vision transformers. The feature tensor XRH×W×DX \in \mathbb{R}^{H \times W \times D} is split along channel dimension: D=DHi+DLoD = D_\text{Hi} + D_\text{Lo}.

  • Low-frequency branch: Downsamples XLoX_\text{Lo} spatially, flattens to tokens, and applies a Lo-ViT block with global cross-attention.
  • High-frequency branch: Retains full spatial granularity, processes with Hi-ViT blocks implementing local self-attention.
  • Outputs are upsampled and concatenated to yield XoutRH×W×DX_\text{out} \in \mathbb{R}^{H \times W \times D}.

After KK successive HiLo Blocks, patch embeddings PRN×DP \in \mathbb{R}^{N \times D} are mapped into branch-specific FSQ spaces via linear projections, THiRN×DFT_\text{Hi} \in \mathbb{R}^{N \times D_F} and TLoRN×DFT_\text{Lo} \in \mathbb{R}^{N \times D_F}, each quantized individually as T~Hi=QHi(THi)\tilde{T}_\text{Hi} = Q_\text{Hi}(T_\text{Hi}), T~Lo=QLo(TLo)\tilde{T}_\text{Lo} = Q_\text{Lo}(T_\text{Lo}). Separation allows distinct quantization settings, preserving capacity for both global structure and fine detail.

4. Complete Se-HiLo Processing Pipeline

Se-HiLo integrates its components in a multi-stage image processing channel:

  • Encoding: Input image IRH×W×CI \in \mathbb{R}^{H \times W \times C} is patchified and embedded. KK HiLo blocks produce PKP^K, which is split into PHiP_\text{Hi} and PLoP_\text{Lo}. Linear projections yield THiT_\text{Hi}, TLoT_\text{Lo}; FSQ is applied to both branches.
  • Channel Transmission: Serialized codes T~Hi\tilde{T}_\text{Hi}, T~Lo\tilde{T}_\text{Lo} are modulated and transmitted. Received versions are perturbed by AWGN (ηHi\eta_\text{Hi}, ηLo\eta_\text{Lo}).
  • Decoding: Dequantization (DeQ\text{DeQ}) reconstructs THi recT_\text{Hi rec}, TLo recT_\text{Lo rec}; outputs are concatenated and reverse HiLo blocks reconstruct the pixel grid I^\hat{I}.
  • Training Loss: End-to-end optimization minimizes

L=Lrecon+λ1Lperceptual+λ2LadvL = L_\text{recon} + \lambda_1 L_\text{perceptual} + \lambda_2 L_\text{adv}

with Lrecon=I^I22L_\text{recon} = \|\hat{I} - I\|_2^2, Lperceptual=ϕ(I^)ϕ(I)22L_\text{perceptual} = \|\phi(\hat{I}) - \phi(I)\|_2^2 (where ϕ\phi is VGG features), and LadvL_\text{adv} the standard GAN loss.

5. Benchmarking and Empirical Results

Experiments employed the Animal-10 dataset (≈28,000 images, 10 classes), optimizing for 50 epochs using Adam (lr=104\text{lr}=10^{-4}, batch=32, dropout=0.1), with quantization levels mHi=mLo=[5,5,5,5,5]m_\text{Hi}=m_\text{Lo}=[5,5,5,5,5], feature dimensions DHi=DLo=256D_\text{Hi}=D_\text{Lo}=256, and K=8K=8 HiLo Blocks.

Comparisons include TiTok, VQGAN, ViT-VQGAN, and MoVQGAN, evaluated across SNR values {10, 5, 0, –5} dB using PSNR and SSIM metrics.

SNR (dB) TiTok VQGAN ViT-VQGAN MoVQGAN Se-HiLo
10 18.2/0.65 22.5/0.78 23.1/0.80 23.5/0.82 27.8/0.89
5 16.0/0.58 20.0/0.72 20.8/0.75 21.1/0.76 25.0/0.86
0 13.5/0.50 17.0/0.65 17.8/0.69 18.2/0.70 22.0/0.82
–5 11.0/0.42 14.0/0.58 14.3/0.61 15.0/0.63 20.0/0.79

Se-HiLo exhibits substantial improvement over baselines, especially in high-noise scenarios. At 0 dB SNR, competing systems yield heavily blurred or artifact-laden reconstructions, whereas Se-HiLo preserved both global shape and texture even at –5 dB.

6. Component Analysis and Ablation Studies

Analysis of noise impact reveals distinct roles for spectral branches: low-frequency noise induces blurring and loss of global form, while high-frequency disruption degrades color and texture. Balanced quantization and decoding for both branches maximizes perceptual fidelity.

Ablative evaluation indicates FSQ-alone increases robustness over VQGAN but incurs a ~2 dB PSNR penalty in noise-free conditions due to restricted diversity. Incorporating HiLo decomposition recovers approximately 1.5 dB of clean PSNR while largely retaining noise resilience. 1:1 channel split between high and low yields the best result; deviation degrades reconstruction quality.

7. Limitations and Prospects

FSQ quantization, by construction, restricts continuous representation, potentially clipping outliers. HiLo Block frequency separation is implicit; incorporation of explicit transforms (DCT, wavelets) could increase interpretability and control. Dynamically scheduled scaling (α\alpha) or adaptive quantization levels (mim_i) may further optimize the trade-off between robustness and expressiveness.

A plausible implication is that the modularity of Se-HiLo’s quantization and decomposition can be further generalized to other forms of semantic source data beyond images. The potential for explicit basis separation and adaptive quantization strategies suggests opportunities for future research in semantic resilient transmission under variable real-world noise conditions (Xi et al., 10 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Se-HiLo.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube