Yuan-TecSwin: Swin-Transformer Diffusion Model

Updated 25 December 2025

Yuan-TecSwin is a text-conditioned diffusion model that integrates Swin-transformer blocks within a U-Net architecture to enhance long-range semantic modeling in text-to-image synthesis.
It replaces all convolutional blocks with Swin-transformers, enabling non-local feature extraction and effective fusion of text and image features through cross-attention mechanisms.
Adaptive inference scheduling and optimized down/up-sampling strategies yield a 12% FID improvement, resulting in image outputs almost indistinguishable from human artwork.

Yuan-TecSwin is a text-conditioned diffusion model that incorporates Swin-transformer blocks within a U-Net-style encoder–decoder architecture, targeting improved long-range semantic modeling in text-to-image synthesis. Unlike prior paradigms relying on convolutional networks, Yuan-TecSwin directly substitutes all convolutional blocks in the encoder and decoder with Swin-transformer modules, facilitating non-local feature extraction while maintaining a strong inductive bias for vision. The model introduces a hybrid text-embedding–image feature fusion mechanism and an adaptive sampling schedule for inference, achieving state-of-the-art performance on major benchmarks and yielding images that are difficult to distinguish from human artwork (Wu et al., 18 Dec 2025).

1. Architectural Framework

Yuan-TecSwin’s architecture centers on a U-shaped encoder–bottleneck–decoder structure, replacing standard convolutional blocks with Swin-transformer blocks at every stage. The encoder comprises four hierarchical stages, each performing patch merging followed by stacked Swin blocks for feature compounding. The bottleneck employs Swin blocks configured with a global window size, increasing receptive field and global representation. The decoder mirrors the encoder with four patch-expanding Swin-based stages and employs skip connections across symmetric layers.

Downsampling in the encoder is accomplished via a 1×1 convolution followed by tensor rearrangement and layer normalization, which outperformed alternative strategies such as Swin PatchMerging. Upsampling in the decoder leverages a stack of 1×1 convolution, SiLU activation, PixelShuffle, rearrangement, and layer normalization, with PixelShuffle demonstrating superior performance over PatchExpand. Text and time step conditioning is integrated into every Swin block through three mechanisms: scale-shift modulation inside residual branches, concatenation of text/time embeddings into key/value inputs for windowed self-attention, and the addition of dedicated @@@@2@@@@ post-SW-MSA (shifted-window multi-head self-attention). Model size totals approximately 341 million parameters.

2. Diffusion Process and Training Objective

Yuan-TecSwin adopts the standard Denoising Diffusion Probabilistic Model (DDPM) formulation, modeling the forward and reverse diffusion process as follows:

Forward Process

For each time step $t$ : $q\left(x_t \mid x_{t-1}\right) = \mathcal{N}\left(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I\right)$ which yields the marginal: $q\left(x_t \mid x_0\right) = \mathcal{N}\left(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I\right)$ where $\alpha_t = 1-\beta_t$ , $\bar\alpha_t = \prod_{s=1}^t \alpha_s$ .

Reverse Process

The denoising (generation) step is parameterized conditionally on text $y$ : $p_\theta(x_{t-1} \mid x_t, y) = \mathcal{N}\left(x_{t-1}; \mu_\theta(x_t, t, y),\ \Sigma_\theta(t)\right)$

Training Objective

The model is trained to estimate the Gaussian noise $\epsilon$ injected at each step, using mean squared error: $L(\theta) = \mathbb{E}_{x_0,\; \epsilon \sim \mathcal{N}(0, I),\; t}\left[\|\epsilon - \epsilon_\theta(x_t, t, y)\|^2\right]$

Classifier-Free Guidance

During inference, classifier-free guidance interpolates between conditional and unconditional outputs: $\epsilon_\theta^{\rm guided}(x_t) = (1+w)\,\epsilon_\theta(x_t \mid y)\;-\;w\,\epsilon_\theta(x_t \mid \varnothing)$ where $w$ is the guidance scale.

3. Text-Conditioning and Feature Integration

The text encoding backbone is a pre-trained Multilingual CLIP model (XLM-RoBERTa Large), handling up to 512 token inputs. Text embeddings are generated by averaging layers $\{1, 23, 24\}$ from the CLIP text encoder; this selection yielded the best FID in ablations. Embeddings for each batch are size $\mathrm{batch} \times 512 \times 256$ . During training, 20% of captions are randomly masked (classifier-free masking).

Fusion of text and image features occurs within each Swin-transformer block through several mechanisms:

Key/Value Concatenation: Image queries compute self-attention $A_{\rm img}$ while text context is projected to the correct dimension for keys/values, enabling cross-attention logits $A_c$ calculation and concatenation of $[A_{\rm img} \parallel A_c]$ and $[V_{\rm img} \parallel V_c]$ per window.
Cross-Attention Layer: After each SW-MSA, a cross-attention layer operates over the full text embedding:

$\mathrm{CrossAttn}(X, Q_c, K_c, V_c) = \mathrm{softmax}\left(Q_c K_c^T/\sqrt{d}\right)V_c$

This is followed by a multi-layer perceptron (MLP) with expansion factor 2.

Scale-Shift: In the residual branch of every Swin block, normalized image features $u$ $u$ are modulated:
1. $u = \operatorname{LayerNorm}(x)$
2. $u = u \odot(\mathrm{scale} + 1) + \mathrm{shift}$ , where scale/shift derive from a linear projection of the concatenated text and time embedding
3. $u = \mathrm{GELU}(u)$
4. $x \leftarrow x + \mathrm{WindowAttn}(u)$

4. Adapted Inference and Time-Step Scheduling

Yuan-TecSwin introduces an adapted time-step sampling policy for inference, motivated by mixture-of-experts strategies. Standard global search identified $T^* = 190$ as optimal for unconditional ImageNet 64×64 performance. Instead of uniform allocation of these steps, the method partitions them into 19 stages of 10 substeps each, and for each stage $i$ , locally searches for the optimal substep count $n_i \in [1, 10]$ . Early and late diffusion stages employ finer sampling, while the mid-stages use coarser steps.

This adaptive schedule yielded an FID improvement of approximately 12% (from 1.56 to 1.37). The final inference algorithm subsamples at a stage-wise variable rate according to the optimized $n_i$ schedule. Pseudocode presented in the data illustrates this process.

5. Experimental Setup and Benchmark Results

Datasets and Preprocessing

Pre-training utilized approximately 1.5 billion image–text pairs (sources: LAION-zh, CC12M, Wukong, Zero, ImageNet captions, and filtering).
Fine-tuning was performed on ~92,000 human-written art prompts, comprising Chinese and Western artistic styles.
Evaluation used ImageNet 64×64 (1.2 million training, 50,000 generated for FID), and MS-COCO 2014 zero-shot FID‐30k (30,000 validation prompts).

Training Hyperparameters

Batch size: 1,024 (global)
Optimizer: Adam
Learning rate schedule: cosine decay from 1.5e-4 to 1.5e-5 with 0.5% warm-up
Fine-tuning: 5 epochs at 1e-6
Swin blocks per encoder/decoder stage: [2, 2, 18, 2]
Stage 1 hidden channels: 128
Query dim per head: 32
MLP expansion: 4 (self-attn), 2 (cross-attn)
Patch/window size: 8 (no initial patch embedding; input is $64 \times 64$ )

Quantitative Outcomes

Model	ImageNet 64×64 FID	MS-COCO FID-30k
Yuan-TecSwin	1.37	6.201
CDM (cascade)	1.48	—
ADM	2.07	—
Improved DDPM	2.92	—
BigGAN-deep	4.06	—
Unet (CNN baseline)	—	7.18

Guidance scale search found an optimal value at 1.14, and time-step search confirmed best performance at 190 adapted steps.

Human Evaluation

In a side-by-side evaluation, 46 images (23 model-generated, 23 human-painted) at 256×256 were assessed by 124 participants in a Turing-style test. The accuracy of identifying model-created images was 51.4%, indicating that images produced by Yuan-TecSwin are effectively indistinguishable from real human works.

6. Ablation Studies and Design Insights

A series of ablation studies investigated architecture and hyperparameter choices:

MLP Expansion Ratios: Best early FID obtained with expansion (self-attn/cross-attn) (4,2); larger cross-attn expansion worsened FID.
Down/Upsampling Modules: 1×1 Conv2D with normalization outperformed Swin PatchMerging for downsampling (FID: 26.03 vs. 44.16); PixelShuffle with normalization outperformed PatchExpand for upsampling (26.03 vs. 33.15).
Scale-Shift Placement: Best performance when scale-shift occurs inside the residual path after normalization, shift, GELU, and window attention (FID: 28.82); moving this elsewhere degrades performance (up to FID 77.6).
Text Embedding Layer Selection: Averaging layers $\{1, 22, 24\}$ outperformed other selections (FID: 26.03 vs. 29.03 or 31.23).
Cross-Attention Frequency: One cross-attn after each SW-MSA yielded optimal results; adding it after every W-MSA gave no further gain.

Replacing CNNs with Swin-transformer blocks confers markedly improved long-range feature modeling and better or comparable performance with only one-fifth the parameter count of CNN-based Unets. The architecture also supports more natural and effective fusion of text conditioning signals via explicit key/value concatenation and cross-attention integration. These outcomes were systematically validated through quantitative and qualitative evaluation on standard benchmarks (Wu et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Yuan-TecSwin.