Cloth Interactive Transformer (CIT)
- The paper introduces CIT, a two-stage architecture that uses transformer modules to capture long-range cloth-person interactions for improved image alignment.
- It integrates a geometric matching stage with TPS regression and a try-on synthesis stage using U-Net refinement to achieve photorealistic results.
- Empirical evaluations on the VITON dataset demonstrate competitive performance in SSIM, LPIPS, and PSNR, enhancing visual fidelity.
The Cloth Interactive Transformer (CIT) is a two-stage deep learning architecture designed for 2D image-based virtual try-on applications. Addressing limitations in prior convolutional neural network (CNN)-based approaches—particularly their inability to model long-range mutual dependencies between clothing and body—the CIT framework utilizes transformer-based modules to explicitly capture interactive relationships between a cloth-agnostic person representation and the target in-shop garment. Empirical results on the VITON dataset demonstrate that this design achieves competitive or superior visual fidelity and realism compared to state-of-the-art baselines (Ren et al., 2021).
1. Overall Pipeline and Data Flow
CIT operates through a structured two-stage process:
Inputs:
- Person image
- In-shop clothing image
- Cloth-agnostic person representation (18-channel pose, 1-channel body shape, 3-channel RGB preserved regions)
Stage I: Geometric Matching
- Feature extraction: CNN downsamplers produce .
- Reshaping and local context integration: Sequences ().
- Interactive Transformer (CIT Matching Block): Establishes long-range cloth-person correlations, outputting .
- Thin-plate-spline (TPS) regression: regresses warp parameters , producing warped clothing and mask .
Stage II: Try-On Synthesis
- Patch embedding and CNN encoding: Produces features , , .
- Interactive Transformer (CIT Reasoning Block): Extracts mutual interactive dependencies among person, warped cloth, and mask, yielding .
- U-Net decoder predicts composition mask and rendered person .
- Output refinement: , .
2. CIT Matching Block (Stage I)
The CIT Matching Block encodes both self-attention and cross-attention mechanisms:
- Inputs: , reshaped into sequences , .
- Self-attention: Each modality computes , , (and analogously for ).
- Cross-attention: Two directions:
- Transformer encoders: Apply selfTrans and crossTrans operators to obtain , and cross-modal features .
- Global strengthened attention: for or .
- Final output: .
This design allows the model to align garments onto the person by learning pixel-level context and mutual dependencies.
3. CIT Reasoning Block (Stage II)
The CIT Reasoning Block models holistic interactions necessary for photo-realistic synthesis:
- Inputs: Sequences , , from CNN and patch embedding.
- Transformer II architecture:
- Three self-attention legs: Independently process person, warped cloth, and mask.
- Six cross-modal legs: For all pairs in person, cloth, mask, compute .
- Concatenation: For each modality, combine outputs from both cross-modal attentions.
- Final output: .
is linearly decoded and guides the mask refinement and composition via the U-Net. This reasoning block corrects for global dependencies and ensures plausible spatial coherence among person, garment, and mask.
4. Optimization, Losses, and Training Regimen
Stage I Loss:
where penalizes local TPS grid gradients.
Stage II Loss:
with denoting perceptual loss calculated at intermediate layers of VGG19.
Optimization:
- Both stages use Adam (, ), initial learning rate (linear decay after 100K steps), batch size 4, trained up to 200K steps, input/output resolution .
This multi-objective loss structure is tailored to preserve structural and perceptual fidelity between inputs and synthesized try-on outputs.
5. Dataset and Quantitative Evaluation
Dataset: The cleaned VITON dataset [Han et al. 2018], comprising 16,253 person-top pairs (front-view women) with 14,221 for training and 2,032 for validation.
Evaluation Protocols:
- Retry-on: Same garment as worn, enabling ground truth comparison.
- Try-on: Different in-shop cloth, ground truth unavailable.
Metrics:
| Stage | Task | Metric | Description |
|---|---|---|---|
| Stage I | Paired | Jaccard Score | IoU of warped mask vs. GT mask |
| Stage II | Paired | SSIM, LPIPS, PSNR | Structural similarity, perceptual similarity, PSNR |
| Stage II | Unpaired | IS, FID, KID | Inception Score, Fréchet, Kernel Inception |
Results (Table 3):
| Method | JS | SSIM | LPIPS | PSNR | IS | FID | KID |
|---|---|---|---|---|---|---|---|
| CP-VTON | 0.759 | 0.800 | 0.126 | 14.54 | 2.832 | 35.16 | 2.245 |
| CP-VTON+ | 0.812 | 0.817 | 0.117 | 21.79 | 3.074 | 25.19 | 1.586 |
| ACGPN | — | 0.846 | 0.121 | 23.08 | 2.924 | 13.79 | 0.818 |
| CIT | 0.800 | 0.827 | 0.115 | 23.46 | 3.060 | 13.97 | 0.761 |
User Study: In a controlled survey (120 cases, 30 users), CIT was rated as producing the “most photo-realistic” and “best preserves cloth detail” images in 32.1% and 35.4% of cases, respectively, surpassing the 20–25% scored by other methods.
6. Qualitative Behavior and Ablation
CIT demonstrates superior geometric fidelity and visual realism. Stage I generates sharper, logo-preserving, and pattern-aligned warps, while Stage II minimizes artifacts at sleeve boundaries and preserves consistent alignment between clothing and body. Ablation studies confirm that combining the CIT Matching and Reasoning blocks yields the highest visual fidelity, with each component contributing distinct improvements in geometric alignment and photorealism.
7. Analysis, Limitations, and Future Research Directions
Strengths:
- Cross-modal attention mechanisms between cloth and person representation explicitly capture long-range and global dependencies, driving improved texture alignment and detail preservation.
- The two-stage framework achieves significantly sharper warps and more natural, artifact-free images than approaches relying solely on CNN modules.
Limitations:
- Performance degrades when reference and in-shop garments exhibit substantial mismatches, particularly in silhouette or texture, which can disrupt mask alignment.
- Severe self-occlusions or extreme pose/shape deviations can produce visible blurring or synthesis artifacts.
Potential Extensions:
- Incorporation of finer segmentation maps or explicit body-part priors to better resolve coverage ambiguities.
- Leveraging 3D body meshes or additional cloth priors to address limitations arising from extreme deformations not handled by 2D projections.
These directions target the core challenges of occlusion handling, cross-instance generalization, and geometric robustness inherent in 2D image-based virtual try-on systems (Ren et al., 2021).