Papers
Topics
Authors
Recent
Search
2000 character limit reached

DriVerse: Diversity in Garment Folding

Updated 16 May 2026
  • DriVerse is a conceptual approach addressing the challenge of diverse, multi-category garment folding through the disentanglement of high-level planning and low-level control.
  • It employs language-conditioned point cloud trajectory planning and foundation models to robustly adapt to various garment types and user instructions.
  • MetaFold's modular design demonstrates improved folding accuracy and generalization, while highlighting areas for future work in sim-to-real transfer and multi-step manipulation.

DriVerse is not itself a framework or system; rather, the term does not directly occur in MetaFold's methods, metrics, or datasets. A plausible implication is that it may refer to the underlying challenge of diverse multi-category garment manipulation—particularly the diversity of garment forms, required user instructions, and semantic heterogeneity in point-cloud-based folding environments. The following exposition synthesizes MetaFold's approach to diversity handling, generalization, and robust garment folding, as evidenced in the literature (Chen et al., 11 Mar 2025).

1. Overview of Diversity in Robotic Garment Folding

Robotic garment folding fundamentally requires addressing broad variability in garment morphology (e.g., no-sleeve shirts, short-sleeve, long-sleeve, pants), fabric dynamics, and user intent as expressed in natural language. Previous approaches often fixed folding actions to manually defined keypoints or pre-enumerated demonstration trajectories, leading to poor cross-category generalization and brittle instruction handling. Within this context, DriVerse is subsumed under a blueprint for generalizable, language-conditioned folding via the disentanglement of planning from control, permitting robust adaptation to diverse task parameters (Chen et al., 11 Mar 2025).

2. Modular System Design for Multi-Category Generalization

The MetaFold framework exemplifies solutions to the diversity problem in deformable object manipulation by separating high-level task planning from low-level action prediction:

  • Task Planning: Employs a CVAE-Transformer module trained to generate full point-cloud-based folding trajectories conditioned on RGB-D geometry and free-form natural language. This planning is invariant to garment category, enabling learned policies to scale across heterogeneous garment sets.
  • Action Prediction: Utilizes the ManiFoundation model for local contact synthesis, mapping adjacent point cloud states to concrete robot grasp points and approach vectors. Because this is decoupled from the planning problem, it generalizes across garment geometries absent task-specific heuristics.

This pipeline allows the system to accept arbitrary user instructions (e.g., "fold left sleeve," "fold pants in half") and convert them into category-robust manipulation sequences (Chen et al., 11 Mar 2025).

3. Trajectory Generation via Language-Conditioned Point Clouds

MetaFold's approach to trajectory planning encapsulates diversity through its use of:

  • Rich spatial representation: PointNet++ networks extract per-vertex features FP∈RN×128F_P \in \mathbb{R}^{N \times 128} for the current garment state.
  • Semantic conditioning: Language features FLF_L are obtained by projecting mean-pooled LLaMA embeddings through an MLP.
  • CVAE-Transformer architecture: Both encoder and decoder are Transformer networks. Training objective Ltraj\mathcal{L}_{\text{traj}} combines reconstruction MSE and KL divergence to the Gaussian prior. At inference, the CVAE decodes sampled latent variables, geometry, and instruction into MM-step folding trajectories in point cloud space.

This architecture enables context-dependent planning across diverse garment types and instruction phrasings (Chen et al., 11 Mar 2025).

4. Foundation Models for Contact Synthesis Across Object Types

Action prediction leverages a foundation model (ManiFoundation, labeled "MF"), trained and slightly fine-tuned for garment manipulation:

  • Contact Synthesis: Input is consecutive trajectory frames (Pi,Pi+1)(P_i, P_{i+1}), output is sets of grasp positions and approach vectors.
  • Point-Flow Attention: MF uses a local movement field estimator to anticipate per-point displacements, supporting flexible adaptation to complex geometries.
  • Ensemble Inference: To increase robustness, 160 stochastic MF runs are performed and outputs are clustered; centroids define final grasp sets.

By decoupling control from planning and adopting foundation model abstractions, the system supports action policies applicable to novel garment configurations (Chen et al., 11 Mar 2025).

5. Experimental Evaluation on Diverse Benchmarks

MetaFold was validated on datasets capturing significant diversity in garment structure:

Dataset Garment Categories Trajectories Zero-Shot?
MetaFold Dataset 4 (no-sleeve, short, long, pants) 3,376 No
Cloth3D Benchmark 4+ (novel clothing) 500 Yes

Metrics substantiating robustness include rectangularity, area ratio, and overall success rate. On the MetaFold Dataset, maximal rectangularity ($0.87$) and minimal area ratio ($0.45$) are attained, outperforming baselines such as UniGarmentManip, GPT-Fabric, and DP3. On zero-shot Cloth3D, MetaFold achieves a $0.97$ success rate, indicating strong generalization beyond the training garment set (Chen et al., 11 Mar 2025).

6. Ablations and Generalization Across Categories and Instructions

Ablation studies isolate the impact of architecture and diversity-handling:

  • Without ManiFoundation: Replacement by naïve point selection drops success from $0.86$ to $0.27$.
  • Without Closed-Loop Re-planning: Success drops to FLF_L0.
  • Frame Frequency: Best result at 10-frames update; too sparse or dense degrades performance.
  • Trajectory Horizon: Full FLF_L1-step prediction significantly outperforms single-step forecasts.

Language generalization tests demonstrate high success on both seen (FLF_L2) and unseen (FLF_L3) instructions; prior models such as L.D. fall to FLF_L4 on unseen commands. This suggests that explicit semantic conditioning is critical for cross-instruction transfer (Chen et al., 11 Mar 2025).

7. Limitations and Future Prospects

Current limitations include:

  • Simulation-to-Real Gap: Physical hardware performance is occasionally limited by unmodeled friction and dynamic effects.
  • Instructional Scope: Only simple one-stage folds are evaluated. Nested or multi-step manipulations (collars, pockets) are unaddressed.
  • Computational Overhead: CVAE-Transformer and MF ensemble introduce latency.

Future directions outlined in (Chen et al., 11 Mar 2025) include deploying tactile sensing, supporting hierarchical language parsing, scaling trajectory generation via diffusion models or normalizing flows, and introducing real-robot reinforcement learning to further improve sim-to-real transfer.

In summary, the methodology pioneered in MetaFold addresses the DriVerse challenge of deformable object manipulation by modularizing trajectory planning and action prediction. This enables state-of-the-art generalization across garment categories and natural language directives, setting a benchmark for point-cloud-based manipulation of diverse deformable objects (Chen et al., 11 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DriVerse.