MetaFold: Robotic Garment Folding Framework
- MetaFold is a modular robotic manipulation framework that decouples high-level language-guided trajectory planning from low-level contact synthesis for garment folding.
- It leverages a Conditional Variational Autoencoder and transformer-based encoder-decoder to generate diverse point cloud trajectories informed by user instructions.
- Empirical evaluations in simulation and real-robot scenarios demonstrate superior rectangularity, success rates, and generalization across various garment types.
MetaFold is a modular robotic manipulation framework for garment folding that disentangles task planning from action prediction, enabling robust multi-category folding and language-guided operation. The system leverages point cloud trajectory generation informed by language instructions for high-level planning and employs a low-level foundation model (ManiFoundation) for action synthesis. This architecture facilitates generalization across garment categories, user instructions, and unseen garment instances, achieving state-of-the-art performance in both simulated and real-robot scenarios (Chen et al., 11 Mar 2025).
1. Architectural Decomposition and System Data Flow
MetaFold divides the garment folding process into two independently optimized modules:
- High-level planning: Language-guided point cloud trajectory generation determines the sequence of target garment configurations.
- Low-level control: Robot actions are generated by ManiFoundation, a foundation model for contact synthesis operating on pairs of consecutive point clouds.
The data flow is as follows:
| Acquisition Stage | Input | Output |
|---|---|---|
| Real-world (perceptual) | RGB-D image | Segmented, downsampled point cloud |
| Simulation | Mesh vertices | Downsampled point cloud |
| High-level planner | , (language) | Trajectory |
| Low-level controller | Contact actions | |
| Execution and Feedback | - | New state , closed-loop replanning |
Closed-loop feedback enables replanning after each low-level execution until the garment state is approximately equal to the goal configuration .
2. Language-Guided Trajectory Generation
2.1 Input Encoding
Spatial and semantic instruction representations are constructed as follows:
- Point clouds: Downsampled (from RGB-D/SAM2 or simulation).
- Language: Instruction 0 (e.g., "fold left sleeve over back") embedded via LLaMA Instruct model, mean-pooled, projected to 1 through an MLP.
Spatial features 2 are extracted with PointNet++ to obtain 3.
2.2 Conditional Variational Trajectory Model
The core of the planner is a Conditional Variational Autoencoder (CVAE) with transformer-based encoder and decoder:
- Encoder: 4, conditioned on features and ground-truth trajectory at training.
- Latent variable: 5 encodes diverse plausible fold trajectories.
- Decoder: 6, outputs a predicted trajectory.
During inference, latent is sampled 7.
Trajectory features 8 are decoded and projected to the predicted trajectory 9.
2.3 Optimization
Training maximizes the evidence lower bound (ELBO):
0
This combines per-frame L2 trajectory reconstruction and KL regularization of the latent space.
3. Low-Level Foundation Model for Robotic Action
3.1 ManiFoundation Model
Action prediction is performed by ManiFoundation, a foundation model for contact synthesis:
- Input: Pairs of successive point clouds 1 and their computed flow.
- Output: Contact locations and motion vectors 2 with 3 (grasp), 4 (direction and distance).
3.2 Fine-Tuning and Ensemble Prediction
- Dataset: Simulated folding episodes with known contact labels.
- Loss: For each contact,
5
- Ensembling: ManiFoundation is sampled with 160 seeds per step. Outputs are clustered within a small threshold 6 in 7-space; final action is selected as contact nearest the cluster mean.
3.3 Interface
The predicted trajectory 8 is decomposed into consecutive pairs, each passed to ManiFoundation to yield the next robot action, closing the high- to low-level loop.
4. Empirical Evaluation and Comparative Performance
4.1 Datasets
- MetaFold dataset: 1,210 meshes; 3,376 folding trials (2,664 train, 712 test); categories: no-sleeve, short-sleeve, long-sleeve, pants.
- Zero-shot: 500 previously unseen Cloth3D garments.
4.2 Metrics
- Rectangularity: 9 (higher better).
- Area Ratio: 0 (lower, more compact).
- Success Rate: Fraction achieving thresholds on above metrics.
4.3 Baselines
- UniGarmentManip: Dense visual correspondence.
- GPT-Fabric: LLM + keypoint detection policy.
- 3D Diffusion Policy: End-to-end action model.
- Deng et al.: Language-guided deformable manipulation.
4.4 Results
| Category | MetaFold Rectangularity | UniG | DP3 | MetaFold Success Rate | UniG | MetaFold Zero-Shot Success |
|---|---|---|---|---|---|---|
| No-sleeve | 0.87 | 0.85 | 0.85 | 0.97 | 0.90 | 0.97 |
| Short-sleeve | 0.83 | 0.78 | 0.82 | 0.88 | 0.71 | 0.88 |
| Long-sleeve | 0.85 | 0.88 | 0.86 | 0.90 | 0.86 | 0.93 |
| Pants | 0.86 | 0.81 | 0.88 | 0.96 | 0.84 | 0.79 |
MetaFold achieves higher rectangularity and success rates across all garment types relative to prior methods. For previously unseen garments (Cloth3D), it achieves success rates of 1–2. In language generalization (unseen instructions), success rates remain 3–4 versus 5–6 for prior work.
5. Ablation Studies and Qualitative Observations
Ablation experiments confirm the necessity of architectural components:
- Ours w/o ManiFoundation: Naïve contact selection reduces success from 0.86 to 0.27.
- Ours w/o closed-loop: Open-loop execution lowers success to 70.07.
- Reduced frequency planning (5 or 15 frames): Moderate decline in rectangularity/success.
- Next-step only (no full trajectory): Success drops to 0.41.
Qualitative analyses (visualizations of ground-truth and generated folding trajectories) demonstrate coherent behavioral adaptation to different language instructions. Real-robot experiments with an xArm6 and RealSense D435 validate trajectory tracking and generalization across garment categories.
6. Generalization, Constraints, and Prospects
MetaFold’s use of raw point clouds (eschewing keypoints or garment templates) allows a single model to fold T-shirts, tank-tops, and pants. The CVAE and LLaMA Instruct encoder provide open-vocabulary instruction grounding, mapping user utterances to canonical fold actions.
Identified constraints include sensitivity to RGB-D segmentation errors, limitation to per-subtask planning (requiring pre-partitioned garment parts), and the stochasticity of ManiFoundation’s sampling (necessitating ensembling). Future research directions include:
- Integrating part-aware segmentation for autonomous decomposition of multi-stage folding,
- End-to-end fine-tuning to couple high- and low-level modules,
- Exploiting stronger vision-LLMs for free-form instruction following,
- Domain adaptation for sim-to-real transfer without heavy real-robot data requirements.
MetaFold demonstrates that modular disentanglement—trajectory planning conditioned on language and continuous robot contact optimization—enables high generalization and robust real-world performance for complex nonrigid manipulation tasks (Chen et al., 11 Mar 2025).