Papers
Topics
Authors
Recent
2000 character limit reached

DiT-JSCC: Diffusion Transformers in JSCC

Updated 13 January 2026
  • DiT-JSCC is a deep joint source-channel coding framework that integrates distributed stereo and semantic generative modules for robust image transmission.
  • It leverages lightweight CNN encoders, SNR-aware cross attention, and diffusion transformer decoders to fuse noisy data and enhance semantic reconstruction.
  • Empirical results show significant gains in perceptual metrics such as LPIPS reduction and lower FID compared to traditional and deep JSCC baselines.

DiT-JSCC refers to distinct but thematically linked frameworks for joint source-channel coded image transmission leveraging deep neural architectures, as described in "Distributed Image Transmission using Deep Joint Source-Channel Coding" (Wang et al., 2022) and "DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations" (Tan et al., 6 Jan 2026). These systems address distributed and extreme-regime semantic communication problems, either for stereo pairs or generic images, embedding domain-specific priors, cross-attention, and generative diffusion backbones for robust, high-fidelity reconstructions under challenging wireless channel conditions.

1. Technical Foundations and Motivation

Deep joint source-channel coding (D-JSCC) fuses source compression and channel coding into a single, differentiable mapping that is optimized end-to-end for reconstruction quality. This approach overcomes limitations of traditional separation theorems—especially in wireless and low-SNR settings—by adapting representations directly to statistical channel conditions and input semantics.

For distributed stereo imaging (Wang et al., 2022), the focus is on exploiting source correlations between spatially proximate or overlapping camera views, which classical separated coding cannot capitalize on without explicit modeling and added protocol overhead. In the generative paradigm (Tan et al., 6 Jan 2026), joint encoding with semantic and detail decomposition aligns deep JSCC with diffusion-based decoders, addressing a key shortcoming: reconstruction-oriented encoders fail to provide semantic cues required for reliable sampling in stochastic generative models, resulting in "realistic" but semantically distorted outputs under high channel noise or low bandwidth.

2. Model Architectures and Information Flow

  • Edge Encoders: Each correlated camera source, X,YRnX,Y\in\mathbb{R}^n, is mapped by a lightweight CNN encoder fθx,fθyf_{\theta_x},f_{\theta_y} into kk complex channel symbols, with power normalization for transmission over two noisy, independent AWGN channels.
  • Noisy Feature Reception: The received vectors s^x,s^y\hat s_x,\hat s_y reflect per-channel SNR, known at both ends.
  • CSI-Aware Cross Attention (SCAM): The center decoder gϕg_\phi receives both feature maps and SNRs, applying cross-attention between Fx,FyF_x,F_y via quality tokens. Softmax-normalized cross-attention, guided by CSI, weights spatially overlapping patches in the stereo pair to exploit mutual information, dynamically down-weighting highly noisy sources.
  • Decoder: Features are recalibrated and fused in a U-Net style decoder to reconstruct both XX and YY, supporting symmetric/asymmetric SNR scenarios without retraining.
  • Dual-Branch Encoder: Images xx are decomposed into semantic (y=EVFM(x)y=E_\mathrm{VFM}(x) using DINOv2) and complementary detail signals. Each branch is encoded by a latent-domain and pixel-domain SwinJSCC encoder, producing bandwidth-controlled complex symbols (sss_s, sds_d).
  • Bandwith Control: Adaptive bandwidth allocation leverages Kolmogorov complexity proxies (BLIP-2 captions: word count, lexical diversity, syntactic complexity) to balance ksk_s and kdk_d per instance, without retraining.
  • Diffusion Transformer Decoder: Latent diffusion (LDM) with a DiT backbone is conditioned on received codes (csc_s, cdc_d), applying coarse-to-fine conditioning: semantic codes guide early blocks (global semantics), detail codes late blocks (textures, edges). Classifier-free guidance interpolates with null semantic embeddings.

3. Training Objectives and Optimization

Stereo DiT-JSCC is trained under mean squared error (MSE) or perceptual MS-SSIM losses: L=E[xx^22+yy^22]\mathcal{L} = \mathbb{E}\left[\|x-\hat x\|_2^2 + \|y-\hat y\|_2^2\right] or

L1-SSIM=E[(1MS ⁣ ⁣SSIM(x,x^))+(1MS ⁣ ⁣SSIM(y,y^))]\mathcal{L}_{\text{1-SSIM}} = \mathbb{E}\left[(1-\mathrm{MS\!-\!SSIM}(x,\hat x)) + (1-\mathrm{MS\!-\!SSIM}(y,\hat y))\right]

using Adam with batchwise randomized SNR.

Diffusion DiT-JSCC is trained solely with the latent diffusion objective: Ldiff=Ez0,c,t,ϵϵϵθ(αˉtz0+1αˉtϵ,cs,cd,t)22\mathcal L_{\mathrm{diff}} = \mathbb{E}_{\mathbf z_0,\mathbf c,t,\boldsymbol\epsilon}\left\| \boldsymbol\epsilon - \epsilon_\theta\left(\sqrt{\bar\alpha_t}\mathbf z_0 + \sqrt{1-\bar\alpha_t}\boldsymbol\epsilon, \mathbf c_s, \mathbf c_d, t \right)\right\|_2^2 with no explicit reconstruction or semantic loss terms in the presented training configuration.

4. Empirical Results and Metrics

  • Datasets: KITTI Stereo 2012/2015; 128×256128\times256 downscaled pairwise images.
  • Channels: AWGN (-3 to 14 dB SNR) and Rayleigh fading (perfect receiver CSI).
  • Metrics: PSNR, MS-SSIM.
  • Performance: Outperforms JPEG2000+BPG+LDPC and parallel D-JSCC under all SNRs in MS-SSIM; PSNR advantage at <<10 dB SNR, gracefully degrades where baselines fail (no cliff). Handles asymmetrical SNRs and maintains lead under Rayleigh fading.
  • Visuals: Reconstructed pairs preserve finer detail, edges, and structure compared to baselines.
  • Datasets: ImageNet-1K (cropped to 256×256256\times256/512×512512\times512).
  • Channels: AWGN/Rayleigh, SNR [5,5]\in [-5,5] dB, CBR $1/384$ to $1/48$.
  • Metrics: LPIPS, DISTS, CLIP similarity, DreamSim, DINOv2, FID (perceptual realism).
  • Comparisons: BPG/LDPC, VTM/LDPC, PerCo, DiffEIC, SwinJSCC, DiffCom, DiffJSCC.
  • Key Quantitative Summary:
Method LPIPS↓ DISTS↓ CLIP↑ DreamSim↓ DINOv2↑ FID↓
NTSCC+DiffCom 0.211 0.191 0.936 0.143 0.900 107
SwinJSCC 0.261 0.233 0.915 0.178 0.878 138
DiT-JSCC 0.166 0.151 0.963 0.097 0.958 60

DiT-JSCC achieves ∼21% LPIPS reduction, halved FID, and notable improvements in semantic scores.

  • Qualitative Outcomes: Only DiT-JSCC maintains object shape and semantic content at SNR=0 dB, CBR=1/96; other diffusion GJSCC methods lose key semantic fidelity under identical conditions.

5. Innovations and Mechanistic Insights

  • Distributed Stereo JSCC: The SNR-aware cross attention module (SCAM) fuses noisy representations, leveraging patch-level statistical dependencies and channel state information without extra transmission overhead. Patchwise fusion enables side-information utilization from correlated sources, generalized to independently noisy links.
  • Generative Semantic JSCC: The dual-branch VFM-driven encoder bridges the semantic gap between JSCC encoders and diffusion decoders. Coarse-to-fine DiT conditioning orders information flow by semantic importance, while classifier-free guidance stabilizes generative sampling.
  • Kolmogorov Complexity-Inspired Bandwidth Allocation: Adaptive, training-free bandwidth allocation realizes efficient semantic-detail tradeoff per image, guided by computational proxies for instance complexity, not requiring model retraining or ad hoc heuristics.

6. Limitations, Open Questions, and Future Directions

  • Coarse patch alignment (8×\times8 grid) underlies distributed stereo fusion; misaligned or unrectified views may require new alignment modules.
  • Decoder complexity (cross-attention, generative diffusion) is nontrivial compared to classical or vanilla D-JSCC systems; lighter fusion and decoding strategies remain an open area.
  • Extension to multi-source settings (e.g., sensor networks, multi-view, multimodal) is plausible, but not yet empirically demonstrated.
  • Principal semantic-value estimation (moving beyond text-based KC surrogates) and integration of emergent VFMs (e.g., DINOv3) represent productive future research trajectories.
  • Application to video and multimodal channels is anticipated as generative semantic communication frameworks mature.

7. Broader Implications and Positioning

DiT-JSCC—spanning distributed and semantic-generative variants—establishes a unified direction for deep JSCC methodologies emphasizing robust, perception-aligned transmission under extreme wireless conditions. By co-designing encoders with explicit semantic priors and employing advanced cross-attention and generative architectures (transformers, diffusion models), these frameworks define contemporary best practices for high-fidelity image transmission, with clear superiority over separation-based and pixel-loss-trained deep coding baselines across semantic and perceptual metrics (Wang et al., 2022, Tan et al., 6 Jan 2026). A plausible implication is that DiT-JSCC methodologies can be generalized to broader communication contexts where semantic fidelity and efficiency are critical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DiT-JSCC.