Cloth Interactive Transformer (CIT)

Updated 11 March 2026

The paper introduces CIT, a two-stage architecture that uses transformer modules to capture long-range cloth-person interactions for improved image alignment.
It integrates a geometric matching stage with TPS regression and a try-on synthesis stage using U-Net refinement to achieve photorealistic results.
Empirical evaluations on the VITON dataset demonstrate competitive performance in SSIM, LPIPS, and PSNR, enhancing visual fidelity.

The Cloth Interactive Transformer (CIT) is a two-stage deep learning architecture designed for 2D image-based virtual try-on applications. Addressing limitations in prior convolutional neural network (CNN)-based approaches—particularly their inability to model long-range mutual dependencies between clothing and body—the CIT framework utilizes transformer-based modules to explicitly capture interactive relationships between a cloth-agnostic person representation and the target in-shop garment. Empirical results on the VITON dataset demonstrate that this design achieves competitive or superior visual fidelity and realism compared to state-of-the-art baselines (Ren et al., 2021).

1. Overall Pipeline and Data Flow

CIT operates through a structured two-stage process:

Inputs:

Person image $I \in \mathbb{R}^{3 \times H \times W}$
In-shop clothing image $c \in \mathbb{R}^{3 \times H \times W}$
Cloth-agnostic person representation $p \in \mathbb{R}^{22 \times H \times W}$ (18-channel pose, 1-channel body shape, 3-channel RGB preserved regions)

Stage I: Geometric Matching

Feature extraction: CNN downsamplers produce $X_p, X_c \in \mathbb{R}^{B \times C \times H \times W}$ .
Reshaping and local context integration: Sequences $F_p, F_c \in \mathbb{R}^{B \times C \times S}$ ( $S=H \cdot W$ ).
Interactive Transformer (CIT Matching Block): Establishes long-range cloth-person correlations, outputting $X_{\text{out-I}}$ .
Thin-plate-spline (TPS) regression: $X_{\text{out-I}}$ regresses warp parameters $\theta$ , producing warped clothing $\hat{c}$ and mask $\hat{c}_m$ .

Stage II: Try-On Synthesis

Patch embedding and CNN encoding: Produces features $X_p$ , $X_{\hat{c}}$ , $X_{\hat{c}_m}$ .
Interactive Transformer (CIT Reasoning Block): Extracts mutual interactive dependencies among person, warped cloth, and mask, yielding $X_{\text{out-II}}$ .
U-Net decoder predicts composition mask $M_o$ and rendered person $I^R$ .
Output refinement: $I^R_\text{glob} = \sigma(X_{\text{out-II}}) \odot I^R$ , $I_o = M_o \odot \hat{c} + (1-M_o) \odot I^R_\text{glob}$ .

2. CIT Matching Block (Stage I)

The CIT Matching Block encodes both self-attention and cross-attention mechanisms:

Inputs: $X_p$ , $X_c$ reshaped into sequences $F_p$ , $F_c$ .
Self-attention: Each modality computes $Q_p= W_Q F_p$ , $K_p = W_K F_p$ , $V_p = W_V F_p$ (and analogously for $F_c$ ).
Cross-attention: Two directions:
- $A_{p \leftarrow c} = \text{softmax}\left(\frac{Q_p K_c^T}{\sqrt{d}}\right) V_c$
- $A_{c \leftarrow p} = \text{softmax}\left(\frac{Q_c K_p^T}{\sqrt{d}}\right) V_p$
Transformer encoders: Apply selfTrans and crossTrans operators to obtain $F_p'$ , $F_c'$ and cross-modal features $X_{\text{cross}}^1$ .
Global strengthened attention: $X_{(.)}^{\text{glob}} = X_{(.)} + X_{(.)} \odot \sigma(\text{Linear}(X_{\text{cross}}^1))$ for $(.)=p$ or $c$ .
Final output: $X_{\text{out-I}} = \text{reshape}[(X_c^{\text{glob}})^T \cdot X_p^{\text{glob}}]$ .

This design allows the model to align garments onto the person by learning pixel-level context and mutual dependencies.

3. CIT Reasoning Block (Stage II)

The CIT Reasoning Block models holistic interactions necessary for photo-realistic synthesis:

Inputs: Sequences $X_p$ , $X_{\hat{c}}$ , $X_{\hat{c}_m}$ from CNN and patch embedding.
Transformer II architecture:
- Three self-attention legs: Independently process person, warped cloth, and mask.
- Six cross-modal legs: For all pairs $(m_1, m_2)$ in $\{$ person, cloth, mask $\}$ , compute $crossTrans(X_{m_1}', X_{m_2}')$ .
- Concatenation: For each modality, combine outputs from both cross-modal attentions.
- Final output: $X_{\text{cross}}^2 = \text{concat}(X_p^{\text{cross}}, X_{\hat{c}}^{\text{cross}}, X_{\hat{c}_m}^{\text{cross}})$ .

$X_{\text{cross}}^2$ is linearly decoded and guides the mask refinement and composition via the U-Net. This reasoning block corrects for global dependencies and ensures plausible spatial coherence among person, garment, and mask.

4. Optimization, Losses, and Training Regimen

Stage I Loss:

$L_{\text{Match}} = L_1(\hat{c}, c_t) + \frac{1}{2} L_{\text{reg}}$

where $L_{\text{reg}}$ penalizes local TPS grid gradients.

Stage II Loss:

$L_{\text{Try}} = \|I_o - I_{GT}\|_1 + L_{VGG} + \|M_o - c_{tm}\|_1$

with $L_{VGG}$ denoting perceptual loss calculated at intermediate layers of VGG19.

Optimization:

Both stages use Adam ( $\beta_1=0.5$ , $\beta_2=0.999$ ), initial learning rate $10^{-4}$ (linear decay after 100K steps), batch size 4, trained up to 200K steps, input/output resolution $256 \times 192$ .

This multi-objective loss structure is tailored to preserve structural and perceptual fidelity between inputs and synthesized try-on outputs.

5. Dataset and Quantitative Evaluation

Dataset: The cleaned VITON dataset [Han et al. 2018], comprising $\sim$ 16,253 person-top pairs (front-view women) with 14,221 for training and 2,032 for validation.

Evaluation Protocols:

Retry-on: Same garment as worn, enabling ground truth comparison.
Try-on: Different in-shop cloth, ground truth unavailable.

Metrics:

Stage	Task	Metric	Description
Stage I	Paired	Jaccard Score	IoU of warped mask vs. GT mask
Stage II	Paired	SSIM, LPIPS, PSNR	Structural similarity, perceptual similarity, PSNR
Stage II	Unpaired	IS, FID, KID	Inception Score, Fréchet, Kernel Inception

Results (Table 3):

Method	JS	SSIM	LPIPS	PSNR	IS	FID	KID
CP-VTON	0.759	0.800	0.126	14.54	2.832	35.16	2.245
CP-VTON+	0.812	0.817	0.117	21.79	3.074	25.19	1.586
ACGPN	—	0.846	0.121	23.08	2.924	13.79	0.818
CIT	0.800	0.827	0.115	23.46	3.060	13.97	0.761

User Study: In a controlled survey (120 cases, 30 users), CIT was rated as producing the “most photo-realistic” and “best preserves cloth detail” images in 32.1% and 35.4% of cases, respectively, surpassing the $\sim$ 20–25% scored by other methods.

6. Qualitative Behavior and Ablation

CIT demonstrates superior geometric fidelity and visual realism. Stage I generates sharper, logo-preserving, and pattern-aligned warps, while Stage II minimizes artifacts at sleeve boundaries and preserves consistent alignment between clothing and body. Ablation studies confirm that combining the CIT Matching and Reasoning blocks yields the highest visual fidelity, with each component contributing distinct improvements in geometric alignment and photorealism.

7. Analysis, Limitations, and Future Research Directions

Strengths:

Cross-modal attention mechanisms between cloth and person representation explicitly capture long-range and global dependencies, driving improved texture alignment and detail preservation.
The two-stage framework achieves significantly sharper warps and more natural, artifact-free images than approaches relying solely on CNN modules.

Limitations:

Performance degrades when reference and in-shop garments exhibit substantial mismatches, particularly in silhouette or texture, which can disrupt mask alignment.
Severe self-occlusions or extreme pose/shape deviations can produce visible blurring or synthesis artifacts.

Potential Extensions:

Incorporation of finer segmentation maps or explicit body-part priors to better resolve coverage ambiguities.
Leveraging 3D body meshes or additional cloth priors to address limitations arising from extreme deformations not handled by 2D projections.

These directions target the core challenges of occlusion handling, cross-instance generalization, and geometric robustness inherent in 2D image-based virtual try-on systems (Ren et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Cloth Interactive Transformer for Virtual Try-On (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cloth Interactive Transformer (CIT).

Cloth Interactive Transformer (CIT)

1. Overall Pipeline and Data Flow

2. CIT Matching Block (Stage I)

3. CIT Reasoning Block (Stage II)

4. Optimization, Losses, and Training Regimen

5. Dataset and Quantitative Evaluation

6. Qualitative Behavior and Ablation

7. Analysis, Limitations, and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Cloth Interactive Transformer (CIT)

1. Overall Pipeline and Data Flow

2. CIT Matching Block (Stage I)

3. CIT Reasoning Block (Stage II)

4. Optimization, Losses, and Training Regimen

5. Dataset and Quantitative Evaluation

6. Qualitative Behavior and Ablation

7. Analysis, Limitations, and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research