Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cloth Interactive Transformer (CIT)

Updated 11 March 2026
  • The paper introduces CIT, a two-stage architecture that uses transformer modules to capture long-range cloth-person interactions for improved image alignment.
  • It integrates a geometric matching stage with TPS regression and a try-on synthesis stage using U-Net refinement to achieve photorealistic results.
  • Empirical evaluations on the VITON dataset demonstrate competitive performance in SSIM, LPIPS, and PSNR, enhancing visual fidelity.

The Cloth Interactive Transformer (CIT) is a two-stage deep learning architecture designed for 2D image-based virtual try-on applications. Addressing limitations in prior convolutional neural network (CNN)-based approaches—particularly their inability to model long-range mutual dependencies between clothing and body—the CIT framework utilizes transformer-based modules to explicitly capture interactive relationships between a cloth-agnostic person representation and the target in-shop garment. Empirical results on the VITON dataset demonstrate that this design achieves competitive or superior visual fidelity and realism compared to state-of-the-art baselines (Ren et al., 2021).

1. Overall Pipeline and Data Flow

CIT operates through a structured two-stage process:

Inputs:

  • Person image IR3×H×WI \in \mathbb{R}^{3 \times H \times W}
  • In-shop clothing image cR3×H×Wc \in \mathbb{R}^{3 \times H \times W}
  • Cloth-agnostic person representation pR22×H×Wp \in \mathbb{R}^{22 \times H \times W} (18-channel pose, 1-channel body shape, 3-channel RGB preserved regions)

Stage I: Geometric Matching

  • Feature extraction: CNN downsamplers produce Xp,XcRB×C×H×WX_p, X_c \in \mathbb{R}^{B \times C \times H \times W}.
  • Reshaping and local context integration: Sequences Fp,FcRB×C×SF_p, F_c \in \mathbb{R}^{B \times C \times S} (S=HWS=H \cdot W).
  • Interactive Transformer (CIT Matching Block): Establishes long-range cloth-person correlations, outputting Xout-IX_{\text{out-I}}.
  • Thin-plate-spline (TPS) regression: Xout-IX_{\text{out-I}} regresses warp parameters θ\theta, producing warped clothing c^\hat{c} and mask c^m\hat{c}_m.

Stage II: Try-On Synthesis

  • Patch embedding and CNN encoding: Produces features XpX_p, Xc^X_{\hat{c}}, Xc^mX_{\hat{c}_m}.
  • Interactive Transformer (CIT Reasoning Block): Extracts mutual interactive dependencies among person, warped cloth, and mask, yielding Xout-IIX_{\text{out-II}}.
  • U-Net decoder predicts composition mask MoM_o and rendered person IRI^R.
  • Output refinement: IglobR=σ(Xout-II)IRI^R_\text{glob} = \sigma(X_{\text{out-II}}) \odot I^R, Io=Moc^+(1Mo)IglobRI_o = M_o \odot \hat{c} + (1-M_o) \odot I^R_\text{glob}.

2. CIT Matching Block (Stage I)

The CIT Matching Block encodes both self-attention and cross-attention mechanisms:

  • Inputs: XpX_p, XcX_c reshaped into sequences FpF_p, FcF_c.
  • Self-attention: Each modality computes Qp=WQFpQ_p= W_Q F_p, Kp=WKFpK_p = W_K F_p, Vp=WVFpV_p = W_V F_p (and analogously for FcF_c).
  • Cross-attention: Two directions:
    • Apc=softmax(QpKcTd)VcA_{p \leftarrow c} = \text{softmax}\left(\frac{Q_p K_c^T}{\sqrt{d}}\right) V_c
    • Acp=softmax(QcKpTd)VpA_{c \leftarrow p} = \text{softmax}\left(\frac{Q_c K_p^T}{\sqrt{d}}\right) V_p
  • Transformer encoders: Apply selfTrans and crossTrans operators to obtain FpF_p', FcF_c' and cross-modal features Xcross1X_{\text{cross}}^1.
  • Global strengthened attention: X(.)glob=X(.)+X(.)σ(Linear(Xcross1))X_{(.)}^{\text{glob}} = X_{(.)} + X_{(.)} \odot \sigma(\text{Linear}(X_{\text{cross}}^1)) for (.)=p(.)=p or cc.
  • Final output: Xout-I=reshape[(Xcglob)TXpglob]X_{\text{out-I}} = \text{reshape}[(X_c^{\text{glob}})^T \cdot X_p^{\text{glob}}].

This design allows the model to align garments onto the person by learning pixel-level context and mutual dependencies.

3. CIT Reasoning Block (Stage II)

The CIT Reasoning Block models holistic interactions necessary for photo-realistic synthesis:

  • Inputs: Sequences XpX_p, Xc^X_{\hat{c}}, Xc^mX_{\hat{c}_m} from CNN and patch embedding.
  • Transformer II architecture:
    • Three self-attention legs: Independently process person, warped cloth, and mask.
    • Six cross-modal legs: For all pairs (m1,m2)(m_1, m_2) in {\{person, cloth, mask}\}, compute crossTrans(Xm1,Xm2)crossTrans(X_{m_1}', X_{m_2}').
    • Concatenation: For each modality, combine outputs from both cross-modal attentions.
    • Final output: Xcross2=concat(Xpcross,Xc^cross,Xc^mcross)X_{\text{cross}}^2 = \text{concat}(X_p^{\text{cross}}, X_{\hat{c}}^{\text{cross}}, X_{\hat{c}_m}^{\text{cross}}).

Xcross2X_{\text{cross}}^2 is linearly decoded and guides the mask refinement and composition via the U-Net. This reasoning block corrects for global dependencies and ensures plausible spatial coherence among person, garment, and mask.

4. Optimization, Losses, and Training Regimen

Stage I Loss:

LMatch=L1(c^,ct)+12LregL_{\text{Match}} = L_1(\hat{c}, c_t) + \frac{1}{2} L_{\text{reg}}

where LregL_{\text{reg}} penalizes local TPS grid gradients.

Stage II Loss:

LTry=IoIGT1+LVGG+Moctm1L_{\text{Try}} = \|I_o - I_{GT}\|_1 + L_{VGG} + \|M_o - c_{tm}\|_1

with LVGGL_{VGG} denoting perceptual loss calculated at intermediate layers of VGG19.

Optimization:

  • Both stages use Adam (β1=0.5\beta_1=0.5, β2=0.999\beta_2=0.999), initial learning rate 10410^{-4} (linear decay after 100K steps), batch size 4, trained up to 200K steps, input/output resolution 256×192256 \times 192.

This multi-objective loss structure is tailored to preserve structural and perceptual fidelity between inputs and synthesized try-on outputs.

5. Dataset and Quantitative Evaluation

Dataset: The cleaned VITON dataset [Han et al. 2018], comprising \sim16,253 person-top pairs (front-view women) with 14,221 for training and 2,032 for validation.

Evaluation Protocols:

  • Retry-on: Same garment as worn, enabling ground truth comparison.
  • Try-on: Different in-shop cloth, ground truth unavailable.

Metrics:

Stage Task Metric Description
Stage I Paired Jaccard Score IoU of warped mask vs. GT mask
Stage II Paired SSIM, LPIPS, PSNR Structural similarity, perceptual similarity, PSNR
Stage II Unpaired IS, FID, KID Inception Score, Fréchet, Kernel Inception

Results (Table 3):

Method JS SSIM LPIPS PSNR IS FID KID
CP-VTON 0.759 0.800 0.126 14.54 2.832 35.16 2.245
CP-VTON+ 0.812 0.817 0.117 21.79 3.074 25.19 1.586
ACGPN 0.846 0.121 23.08 2.924 13.79 0.818
CIT 0.800 0.827 0.115 23.46 3.060 13.97 0.761

User Study: In a controlled survey (120 cases, 30 users), CIT was rated as producing the “most photo-realistic” and “best preserves cloth detail” images in 32.1% and 35.4% of cases, respectively, surpassing the \sim20–25% scored by other methods.

6. Qualitative Behavior and Ablation

CIT demonstrates superior geometric fidelity and visual realism. Stage I generates sharper, logo-preserving, and pattern-aligned warps, while Stage II minimizes artifacts at sleeve boundaries and preserves consistent alignment between clothing and body. Ablation studies confirm that combining the CIT Matching and Reasoning blocks yields the highest visual fidelity, with each component contributing distinct improvements in geometric alignment and photorealism.

7. Analysis, Limitations, and Future Research Directions

Strengths:

  • Cross-modal attention mechanisms between cloth and person representation explicitly capture long-range and global dependencies, driving improved texture alignment and detail preservation.
  • The two-stage framework achieves significantly sharper warps and more natural, artifact-free images than approaches relying solely on CNN modules.

Limitations:

  • Performance degrades when reference and in-shop garments exhibit substantial mismatches, particularly in silhouette or texture, which can disrupt mask alignment.
  • Severe self-occlusions or extreme pose/shape deviations can produce visible blurring or synthesis artifacts.

Potential Extensions:

  • Incorporation of finer segmentation maps or explicit body-part priors to better resolve coverage ambiguities.
  • Leveraging 3D body meshes or additional cloth priors to address limitations arising from extreme deformations not handled by 2D projections.

These directions target the core challenges of occlusion handling, cross-instance generalization, and geometric robustness inherent in 2D image-based virtual try-on systems (Ren et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cloth Interactive Transformer (CIT).