Papers
Topics
Authors
Recent
Search
2000 character limit reached

SwiftTailor Framework

Updated 25 March 2026
  • SwiftTailor is a two-stage framework integrating sewing-pattern reasoning and geometry-based mesh synthesis for efficient 3D garment generation.
  • It employs a multimodal PatternMaker and a high-performance GarmentSewer module to deliver scalable, interpretable digital garment pipelines.
  • Empirical evaluations demonstrate state-of-the-art accuracy and significantly reduced inference time compared to prior vision-language and physics-based approaches.

SwiftTailor is a two-stage framework that addresses efficient and realistic 3D garment generation in computer vision and digital fashion. By integrating sewing-pattern reasoning with geometry-based mesh synthesis via a compact geometry image representation, SwiftTailor offers a scalable, interpretable, and high-performance solution for next-generation digital garment pipeline design. The framework accelerates inference while yielding state-of-the-art accuracy and visual fidelity compared to prior methods based on large vision-LLMs paired with physics-based simulation (Pham et al., 19 Mar 2026).

1. Framework Architecture and Pipeline

SwiftTailor employs a two-stage pipeline unifying multimodal pattern inference and mesh synthesis:

  1. PatternMaker: This stage ingests one or more inputs—including garment images, sketches, and/or text descriptions—and outputs a structured sewing pattern. The sewing pattern comprises:
    • A set of panels P={Pi=(Vi,Ei,Ri)}i=1NP = \{P_i = (V_i, E_i, R_i)\}_{i=1}^N where ViV_i are 2D panel vertices, EiE_i are panel edges, and RiR_i denotes the 3D rigid placement of panel ii.
    • A stitching set S={sk=(ea,eb)}k=1MS = \{s_k = (e_a,e_b)\}_{k=1}^M indicating edge pairs to be stitched.
  2. GarmentSewer + Garment Geometry Image (GGI): This stage uses the predicted sewing pattern to generate a UV-packed geometry image, consisting of:
    • A semantic map: encodes panel types via color in UV space.
    • A stitching map: colors boundary edges by seam IDs.
    • A geometry map: stores 3D coordinates (x,y,z)(x, y, z) for each UV pixel. The GarmentSewer dense prediction transformer predicts a complete geometry image from these maps, which is then post-processed into a watertight 3D mesh through inverse mapping, remeshing, and dynamic stitching procedures (Pham et al., 19 Mar 2026).

2. PatternMaker Module: Vision-Language Multimodal Pattern Synthesis

PatternMaker is a lightweight multimodal vision-LLM based on the InternVL-3-2B architecture (~2B parameters). Its backbone consists of image/text encoders—InternVL exploiting both visual transformers and language branches:

  • Tokenization: Panel layouts, edge types, and stitching tags are serialized as discrete tokens following the AIpparel tokenization scheme.
  • Prediction heads:
    • Discrete: Next-token classification (e.g., panel count, edge topology, seam tags).
    • Continuous: Multilayer perceptron regressors predict vertex locations (ViV_i) and rigid transformation parameters (RiR_i).
  • Training objectives: Joint cross-entropy for token sequence prediction and L2 regression for continuous outputs.
  • Multimodal input handling: Visual content (images or sketches) is encoded via InternVL’s visual transformer; text prompts are encoded via its LLM branch, and the fused representation is autoregressively decoded into a sewing-pattern program (Pham et al., 19 Mar 2026).

3. GarmentSewer Module and Garment Geometry Image Representation

The second stage employs the GarmentSewer module, a Dense Prediction Transformer (DPT):

  • Encoder: ViT-L backbone with 24 transformer layers, pre-trained on ImageNet, processes semantic UV maps and optional stitching maps.
  • Positional encoding: Learnable 2D embeddings added to patch embeddings.
  • Decoder: Multiscale convolutional upsampling fuses hierarchical token features to reconstruct a geometry image.
  • Mapping: The network learns M:[0,1]2→R3M: [0,1]^2 \to \mathbb{R}^3 such that G^(u,v)≈G(u,v)=(x(u,v),y(u,v),z(u,v))\hat{G}(u,v) \approx G(u,v) = (x(u,v), y(u,v), z(u,v)) at each UV pixel.
  • Losses:
    • Edge-aware regression: Lreg=∥G−G^∥1L_\text{reg} = \|G - \hat{G}\|_1 over interiors plus α∥⋅∥1\alpha\|\cdot\|_1 over edge bands.
    • Stitching loss (Chamfer): Lstitch=(1/∣S∣)∑(ea,eb)∈SCD(G^edge(ea),G^edge(eb))L_\text{stitch} = (1/|S|)\sum_{(e_a, e_b)\in S} \text{CD}(\hat{G}_\text{edge}(e_a), \hat{G}_\text{edge}(e_b)).
    • Normal smoothness regularizer: Promotes consistent normals.

The GGI aligns all UV-mapped panels into a single square layout using bin-packing, ensuring consistent outward normals (Pham et al., 19 Mar 2026).

4. UV Parameterization, Postprocessing, and Dynamic Stitching

SwiftTailor reconstructs the 3D mesh through:

  • Inverse mapping: Each valid (u,v)(u,v) pixel in the predicted GGI is interpreted as a 3D point.
  • Remeshing algorithm:
    • For every 2×22\times2 UV pixel cell: If three occupied corners, create one triangle; if four, select the diagonal minimizing 3D edge length, form two triangles.
  • Dynamic stitching algorithm:
    • Seam edges are identified in the stitching map and grouped by seam ID.
    • Edge pairs are aligned via Dynamic Time Warping of UV pixel sequences.
    • Corresponding 3D vertices are merged using Disjoint Set Union, with final positions averaged across clusters.
    • Face indices are updated; degenerate faces are removed.

This strategy amortizes the cost of physics-based simulation (typically 30–60 s) to a total mesh assembly runtime of ≈14.8 s (PatternMaker ≈10 s, GarmentSewer 0.02 s, remeshing 4.8 s, stitching <0.1 s), compared to ≈64 s for AIpparel+GarmentCode (Pham et al., 19 Mar 2026).

5. Empirical Evaluation

Experiments use the Multimodal GarmentCodeData benchmark:

  • PatternMaker performance (image/text inputs):
    • Vertex L2: 3.5 px / 1.5 px (vs AIpparel 4.8 / 2.5, ChatGarment 14.9 / 13.5).
    • #Panel Accuracy: 94.8% / 85.0% (vs AIpparel 93.7 / 82.9).
    • #Edge Accuracy: 92.3% / 98.0% (vs AIpparel 79.0% / 88.2%).
    • Rotation L2: 0.006 / 0.002 (vs 0.007 / 0.002).
    • Translation L2: 1.9 / 1.3 cm (vs 2.5 / 1.7).
    • Stitch Accuracy: 85.1% / 97.8% (vs 73.0% / 86.3).
  • Garment mesh generation:
    • Minimum Matching Distance (MMD): 5.31 (vs AIpparel+GarmentCode 6.94).
    • Coverage (COV): 0.68 (vs 0.52).
    • Avg. sampling attempts: 2.98 (vs 4.27).
  • Total inference time for full pipeline: ≈14.8 s (vs ≈63.7 s for baseline) (Pham et al., 19 Mar 2026).
Metric SwiftTailor AIpparel+GarmentCode ChatGarment
Vertex L2 (img/txt, px) 3.5 / 1.5 4.8 / 2.5 14.9 / 13.5
#Panel Acc (img/txt, %) 94.8 / 85.0 93.7 / 82.9 —
MMD 5.31 6.94 —
COV 0.68 0.52 —
Runtime (total, s) 14.78 63.74 —

6. Design Implications, Modularity, and Interpretability

Each SwiftTailor module is lightweight: PatternMaker contains 30% of the parameters of prior VLMs, GarmentSewer is a single feedforward transformer. Modularity permits GarmentSewer to be paired with alternative pattern generators; similarly, intermediate representations such as sewing patterns and GGIs allow human inspection and manual editing at various levels.

A plausible implication is that this architecture increases the accessibility of interpretable 3D garment workflows for digital fashion, simulation, and virtual try-on applications. However, SwiftTailor presently produces smoothed surfaces lacking high-frequency wrinkles, suggesting an avenue for future lightweight physical or neural refinements.

7. Limitations and Future Directions

Noted limitations include:

  • Absence of fine surface wrinkles in generated meshes, which may be addressed via post-hoc neural or lightweight physical refinement atop the GGI initialization.
  • Reduced robustness to in-the-wild photographs exhibiting extreme occlusion or atypical garment silhouettes.
  • Lack of direct support for garment texture generation, material property assignment, or interactive user-facing editing.

Future research directions include enhancing surface detail, increasing robustness to diverse imagery, and extending the framework to handle additional garment attributes or interactive systems (Pham et al., 19 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SwiftTailor Framework.