Papers
Topics
Authors
Recent
2000 character limit reached

Yuan-TecSwin: Swin-Transformer Diffusion Model

Updated 25 December 2025
  • Yuan-TecSwin is a text-conditioned diffusion model that integrates Swin-transformer blocks within a U-Net architecture to enhance long-range semantic modeling in text-to-image synthesis.
  • It replaces all convolutional blocks with Swin-transformers, enabling non-local feature extraction and effective fusion of text and image features through cross-attention mechanisms.
  • Adaptive inference scheduling and optimized down/up-sampling strategies yield a 12% FID improvement, resulting in image outputs almost indistinguishable from human artwork.

Yuan-TecSwin is a text-conditioned diffusion model that incorporates Swin-transformer blocks within a U-Net-style encoder–decoder architecture, targeting improved long-range semantic modeling in text-to-image synthesis. Unlike prior paradigms relying on convolutional networks, Yuan-TecSwin directly substitutes all convolutional blocks in the encoder and decoder with Swin-transformer modules, facilitating non-local feature extraction while maintaining a strong inductive bias for vision. The model introduces a hybrid text-embedding–image feature fusion mechanism and an adaptive sampling schedule for inference, achieving state-of-the-art performance on major benchmarks and yielding images that are difficult to distinguish from human artwork (Wu et al., 18 Dec 2025).

1. Architectural Framework

Yuan-TecSwin’s architecture centers on a U-shaped encoder–bottleneck–decoder structure, replacing standard convolutional blocks with Swin-transformer blocks at every stage. The encoder comprises four hierarchical stages, each performing patch merging followed by stacked Swin blocks for feature compounding. The bottleneck employs Swin blocks configured with a global window size, increasing receptive field and global representation. The decoder mirrors the encoder with four patch-expanding Swin-based stages and employs skip connections across symmetric layers.

Downsampling in the encoder is accomplished via a 1×1 convolution followed by tensor rearrangement and layer normalization, which outperformed alternative strategies such as Swin PatchMerging. Upsampling in the decoder leverages a stack of 1×1 convolution, SiLU activation, PixelShuffle, rearrangement, and layer normalization, with PixelShuffle demonstrating superior performance over PatchExpand. Text and time step conditioning is integrated into every Swin block through three mechanisms: scale-shift modulation inside residual branches, concatenation of text/time embeddings into key/value inputs for windowed self-attention, and the addition of dedicated @@@@2@@@@ post-SW-MSA (shifted-window multi-head self-attention). Model size totals approximately 341 million parameters.

2. Diffusion Process and Training Objective

Yuan-TecSwin adopts the standard Denoising Diffusion Probabilistic Model (DDPM) formulation, modeling the forward and reverse diffusion process as follows:

Forward Process

For each time step tt: q(xtxt1)=N(xt;1βtxt1,βtI)q\left(x_t \mid x_{t-1}\right) = \mathcal{N}\left(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I\right) which yields the marginal: q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q\left(x_t \mid x_0\right) = \mathcal{N}\left(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I\right) where αt=1βt\alpha_t = 1-\beta_t, αˉt=s=1tαs\bar\alpha_t = \prod_{s=1}^t \alpha_s.

Reverse Process

The denoising (generation) step is parameterized conditionally on text yy: pθ(xt1xt,y)=N(xt1;μθ(xt,t,y), Σθ(t))p_\theta(x_{t-1} \mid x_t, y) = \mathcal{N}\left(x_{t-1}; \mu_\theta(x_t, t, y),\ \Sigma_\theta(t)\right)

Training Objective

The model is trained to estimate the Gaussian noise ϵ\epsilon injected at each step, using mean squared error: L(θ)=Ex0,  ϵN(0,I),  t[ϵϵθ(xt,t,y)2]L(\theta) = \mathbb{E}_{x_0,\; \epsilon \sim \mathcal{N}(0, I),\; t}\left[\|\epsilon - \epsilon_\theta(x_t, t, y)\|^2\right]

Classifier-Free Guidance

During inference, classifier-free guidance interpolates between conditional and unconditional outputs: ϵθguided(xt)=(1+w)ϵθ(xty)    wϵθ(xt)\epsilon_\theta^{\rm guided}(x_t) = (1+w)\,\epsilon_\theta(x_t \mid y)\;-\;w\,\epsilon_\theta(x_t \mid \varnothing) where ww is the guidance scale.

3. Text-Conditioning and Feature Integration

The text encoding backbone is a pre-trained Multilingual CLIP model (XLM-RoBERTa Large), handling up to 512 token inputs. Text embeddings are generated by averaging layers {1,23,24}\{1, 23, 24\} from the CLIP text encoder; this selection yielded the best FID in ablations. Embeddings for each batch are size batch×512×256\mathrm{batch} \times 512 \times 256. During training, 20% of captions are randomly masked (classifier-free masking).

Fusion of text and image features occurs within each Swin-transformer block through several mechanisms:

  • Key/Value Concatenation: Image queries compute self-attention AimgA_{\rm img} while text context is projected to the correct dimension for keys/values, enabling cross-attention logits AcA_c calculation and concatenation of [AimgAc][A_{\rm img} \parallel A_c] and [VimgVc][V_{\rm img} \parallel V_c] per window.
  • Cross-Attention Layer: After each SW-MSA, a cross-attention layer operates over the full text embedding:

CrossAttn(X,Qc,Kc,Vc)=softmax(QcKcT/d)Vc\mathrm{CrossAttn}(X, Q_c, K_c, V_c) = \mathrm{softmax}\left(Q_c K_c^T/\sqrt{d}\right)V_c

This is followed by a multi-layer perceptron (MLP) with expansion factor 2.

  • Scale-Shift: In the residual branch of every Swin block, normalized image features uu are modulated:
    1. u=LayerNorm(x)u = \operatorname{LayerNorm}(x)
    2. u=u(scale+1)+shiftu = u \odot(\mathrm{scale} + 1) + \mathrm{shift}, where scale/shift derive from a linear projection of the concatenated text and time embedding
    3. u=GELU(u)u = \mathrm{GELU}(u)
    4. xx+WindowAttn(u)x \leftarrow x + \mathrm{WindowAttn}(u)

4. Adapted Inference and Time-Step Scheduling

Yuan-TecSwin introduces an adapted time-step sampling policy for inference, motivated by mixture-of-experts strategies. Standard global search identified T=190T^* = 190 as optimal for unconditional ImageNet 64×64 performance. Instead of uniform allocation of these steps, the method partitions them into 19 stages of 10 substeps each, and for each stage ii, locally searches for the optimal substep count ni[1,10]n_i \in [1, 10]. Early and late diffusion stages employ finer sampling, while the mid-stages use coarser steps.

This adaptive schedule yielded an FID improvement of approximately 12% (from 1.56 to 1.37). The final inference algorithm subsamples at a stage-wise variable rate according to the optimized nin_i schedule. Pseudocode presented in the data illustrates this process.

5. Experimental Setup and Benchmark Results

Datasets and Preprocessing

  • Pre-training utilized approximately 1.5 billion image–text pairs (sources: LAION-zh, CC12M, Wukong, Zero, ImageNet captions, and filtering).
  • Fine-tuning was performed on ~92,000 human-written art prompts, comprising Chinese and Western artistic styles.
  • Evaluation used ImageNet 64×64 (1.2 million training, 50,000 generated for FID), and MS-COCO 2014 zero-shot FID‐30k (30,000 validation prompts).

Training Hyperparameters

  • Batch size: 1,024 (global)
  • Optimizer: Adam
  • Learning rate schedule: cosine decay from 1.5e-4 to 1.5e-5 with 0.5% warm-up
  • Fine-tuning: 5 epochs at 1e-6
  • Swin blocks per encoder/decoder stage: [2, 2, 18, 2]
  • Stage 1 hidden channels: 128
  • Query dim per head: 32
  • MLP expansion: 4 (self-attn), 2 (cross-attn)
  • Patch/window size: 8 (no initial patch embedding; input is 64×6464 \times 64)

Quantitative Outcomes

Model ImageNet 64×64 FID MS-COCO FID-30k
Yuan-TecSwin 1.37 6.201
CDM (cascade) 1.48
ADM 2.07
Improved DDPM 2.92
BigGAN-deep 4.06
Unet (CNN baseline) 7.18

Guidance scale search found an optimal value at 1.14, and time-step search confirmed best performance at 190 adapted steps.

Human Evaluation

In a side-by-side evaluation, 46 images (23 model-generated, 23 human-painted) at 256×256 were assessed by 124 participants in a Turing-style test. The accuracy of identifying model-created images was 51.4%, indicating that images produced by Yuan-TecSwin are effectively indistinguishable from real human works.

6. Ablation Studies and Design Insights

A series of ablation studies investigated architecture and hyperparameter choices:

  • MLP Expansion Ratios: Best early FID obtained with expansion (self-attn/cross-attn) (4,2); larger cross-attn expansion worsened FID.
  • Down/Upsampling Modules: 1×1 Conv2D with normalization outperformed Swin PatchMerging for downsampling (FID: 26.03 vs. 44.16); PixelShuffle with normalization outperformed PatchExpand for upsampling (26.03 vs. 33.15).
  • Scale-Shift Placement: Best performance when scale-shift occurs inside the residual path after normalization, shift, GELU, and window attention (FID: 28.82); moving this elsewhere degrades performance (up to FID 77.6).
  • Text Embedding Layer Selection: Averaging layers {1,22,24}\{1, 22, 24\} outperformed other selections (FID: 26.03 vs. 29.03 or 31.23).
  • Cross-Attention Frequency: One cross-attn after each SW-MSA yielded optimal results; adding it after every W-MSA gave no further gain.

Replacing CNNs with Swin-transformer blocks confers markedly improved long-range feature modeling and better or comparable performance with only one-fifth the parameter count of CNN-based Unets. The architecture also supports more natural and effective fusion of text conditioning signals via explicit key/value concatenation and cross-attention integration. These outcomes were systematically validated through quantitative and qualitative evaluation on standard benchmarks (Wu et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Yuan-TecSwin.