Yuan-TecSwin: Swin-Transformer Diffusion Model
- Yuan-TecSwin is a text-conditioned diffusion model that integrates Swin-transformer blocks within a U-Net architecture to enhance long-range semantic modeling in text-to-image synthesis.
- It replaces all convolutional blocks with Swin-transformers, enabling non-local feature extraction and effective fusion of text and image features through cross-attention mechanisms.
- Adaptive inference scheduling and optimized down/up-sampling strategies yield a 12% FID improvement, resulting in image outputs almost indistinguishable from human artwork.
Yuan-TecSwin is a text-conditioned diffusion model that incorporates Swin-transformer blocks within a U-Net-style encoder–decoder architecture, targeting improved long-range semantic modeling in text-to-image synthesis. Unlike prior paradigms relying on convolutional networks, Yuan-TecSwin directly substitutes all convolutional blocks in the encoder and decoder with Swin-transformer modules, facilitating non-local feature extraction while maintaining a strong inductive bias for vision. The model introduces a hybrid text-embedding–image feature fusion mechanism and an adaptive sampling schedule for inference, achieving state-of-the-art performance on major benchmarks and yielding images that are difficult to distinguish from human artwork (Wu et al., 18 Dec 2025).
1. Architectural Framework
Yuan-TecSwin’s architecture centers on a U-shaped encoder–bottleneck–decoder structure, replacing standard convolutional blocks with Swin-transformer blocks at every stage. The encoder comprises four hierarchical stages, each performing patch merging followed by stacked Swin blocks for feature compounding. The bottleneck employs Swin blocks configured with a global window size, increasing receptive field and global representation. The decoder mirrors the encoder with four patch-expanding Swin-based stages and employs skip connections across symmetric layers.
Downsampling in the encoder is accomplished via a 1×1 convolution followed by tensor rearrangement and layer normalization, which outperformed alternative strategies such as Swin PatchMerging. Upsampling in the decoder leverages a stack of 1×1 convolution, SiLU activation, PixelShuffle, rearrangement, and layer normalization, with PixelShuffle demonstrating superior performance over PatchExpand. Text and time step conditioning is integrated into every Swin block through three mechanisms: scale-shift modulation inside residual branches, concatenation of text/time embeddings into key/value inputs for windowed self-attention, and the addition of dedicated @@@@2@@@@ post-SW-MSA (shifted-window multi-head self-attention). Model size totals approximately 341 million parameters.
2. Diffusion Process and Training Objective
Yuan-TecSwin adopts the standard Denoising Diffusion Probabilistic Model (DDPM) formulation, modeling the forward and reverse diffusion process as follows:
Forward Process
For each time step : which yields the marginal: where , .
Reverse Process
The denoising (generation) step is parameterized conditionally on text :
Training Objective
The model is trained to estimate the Gaussian noise injected at each step, using mean squared error:
Classifier-Free Guidance
During inference, classifier-free guidance interpolates between conditional and unconditional outputs: where is the guidance scale.
3. Text-Conditioning and Feature Integration
The text encoding backbone is a pre-trained Multilingual CLIP model (XLM-RoBERTa Large), handling up to 512 token inputs. Text embeddings are generated by averaging layers from the CLIP text encoder; this selection yielded the best FID in ablations. Embeddings for each batch are size . During training, 20% of captions are randomly masked (classifier-free masking).
Fusion of text and image features occurs within each Swin-transformer block through several mechanisms:
- Key/Value Concatenation: Image queries compute self-attention while text context is projected to the correct dimension for keys/values, enabling cross-attention logits calculation and concatenation of and per window.
- Cross-Attention Layer: After each SW-MSA, a cross-attention layer operates over the full text embedding:
This is followed by a multi-layer perceptron (MLP) with expansion factor 2.
- Scale-Shift: In the residual branch of every Swin block, normalized image features are modulated:
- , where scale/shift derive from a linear projection of the concatenated text and time embedding
4. Adapted Inference and Time-Step Scheduling
Yuan-TecSwin introduces an adapted time-step sampling policy for inference, motivated by mixture-of-experts strategies. Standard global search identified as optimal for unconditional ImageNet 64×64 performance. Instead of uniform allocation of these steps, the method partitions them into 19 stages of 10 substeps each, and for each stage , locally searches for the optimal substep count . Early and late diffusion stages employ finer sampling, while the mid-stages use coarser steps.
This adaptive schedule yielded an FID improvement of approximately 12% (from 1.56 to 1.37). The final inference algorithm subsamples at a stage-wise variable rate according to the optimized schedule. Pseudocode presented in the data illustrates this process.
5. Experimental Setup and Benchmark Results
Datasets and Preprocessing
- Pre-training utilized approximately 1.5 billion image–text pairs (sources: LAION-zh, CC12M, Wukong, Zero, ImageNet captions, and filtering).
- Fine-tuning was performed on ~92,000 human-written art prompts, comprising Chinese and Western artistic styles.
- Evaluation used ImageNet 64×64 (1.2 million training, 50,000 generated for FID), and MS-COCO 2014 zero-shot FID‐30k (30,000 validation prompts).
Training Hyperparameters
- Batch size: 1,024 (global)
- Optimizer: Adam
- Learning rate schedule: cosine decay from 1.5e-4 to 1.5e-5 with 0.5% warm-up
- Fine-tuning: 5 epochs at 1e-6
- Swin blocks per encoder/decoder stage: [2, 2, 18, 2]
- Stage 1 hidden channels: 128
- Query dim per head: 32
- MLP expansion: 4 (self-attn), 2 (cross-attn)
- Patch/window size: 8 (no initial patch embedding; input is )
Quantitative Outcomes
| Model | ImageNet 64×64 FID | MS-COCO FID-30k |
|---|---|---|
| Yuan-TecSwin | 1.37 | 6.201 |
| CDM (cascade) | 1.48 | — |
| ADM | 2.07 | — |
| Improved DDPM | 2.92 | — |
| BigGAN-deep | 4.06 | — |
| Unet (CNN baseline) | — | 7.18 |
Guidance scale search found an optimal value at 1.14, and time-step search confirmed best performance at 190 adapted steps.
Human Evaluation
In a side-by-side evaluation, 46 images (23 model-generated, 23 human-painted) at 256×256 were assessed by 124 participants in a Turing-style test. The accuracy of identifying model-created images was 51.4%, indicating that images produced by Yuan-TecSwin are effectively indistinguishable from real human works.
6. Ablation Studies and Design Insights
A series of ablation studies investigated architecture and hyperparameter choices:
- MLP Expansion Ratios: Best early FID obtained with expansion (self-attn/cross-attn) (4,2); larger cross-attn expansion worsened FID.
- Down/Upsampling Modules: 1×1 Conv2D with normalization outperformed Swin PatchMerging for downsampling (FID: 26.03 vs. 44.16); PixelShuffle with normalization outperformed PatchExpand for upsampling (26.03 vs. 33.15).
- Scale-Shift Placement: Best performance when scale-shift occurs inside the residual path after normalization, shift, GELU, and window attention (FID: 28.82); moving this elsewhere degrades performance (up to FID 77.6).
- Text Embedding Layer Selection: Averaging layers outperformed other selections (FID: 26.03 vs. 29.03 or 31.23).
- Cross-Attention Frequency: One cross-attn after each SW-MSA yielded optimal results; adding it after every W-MSA gave no further gain.
Replacing CNNs with Swin-transformer blocks confers markedly improved long-range feature modeling and better or comparable performance with only one-fifth the parameter count of CNN-based Unets. The architecture also supports more natural and effective fusion of text conditioning signals via explicit key/value concatenation and cross-attention integration. These outcomes were systematically validated through quantitative and qualitative evaluation on standard benchmarks (Wu et al., 18 Dec 2025).