MetaFold: Robotic Garment Folding Framework

Updated 16 May 2026

MetaFold is a modular robotic manipulation framework that decouples high-level language-guided trajectory planning from low-level contact synthesis for garment folding.
It leverages a Conditional Variational Autoencoder and transformer-based encoder-decoder to generate diverse point cloud trajectories informed by user instructions.
Empirical evaluations in simulation and real-robot scenarios demonstrate superior rectangularity, success rates, and generalization across various garment types.

MetaFold is a modular robotic manipulation framework for garment folding that disentangles task planning from action prediction, enabling robust multi-category folding and language-guided operation. The system leverages point cloud trajectory generation informed by language instructions for high-level planning and employs a low-level foundation model (ManiFoundation) for action synthesis. This architecture facilitates generalization across garment categories, user instructions, and unseen garment instances, achieving state-of-the-art performance in both simulated and real-robot scenarios (Chen et al., 11 Mar 2025).

1. Architectural Decomposition and System Data Flow

MetaFold divides the garment folding process into two independently optimized modules:

High-level planning: Language-guided point cloud trajectory generation determines the sequence of target garment configurations.
Low-level control: Robot actions are generated by ManiFoundation, a foundation model for contact synthesis operating on pairs of consecutive point clouds.

The data flow is as follows:

Acquisition Stage	Input	Output
Real-world (perceptual)	RGB-D image	Segmented, downsampled point cloud $\mathcal{P}\in\mathbb{R}^{N\times 3}$
Simulation	Mesh vertices	Downsampled point cloud
High-level planner	$\mathcal{P}$ , $\mathcal{L}$ (language)	Trajectory $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$
Low-level controller	$(\mathcal{P}_t, \mathcal{P}_{t+1})$	Contact actions $\boldsymbol{a} = \{(\boldsymbol{p}_i, \boldsymbol{s}_i)\}$
Execution and Feedback	-	New state $\mathcal{P}'$ , closed-loop replanning

Closed-loop feedback enables replanning after each low-level execution until the garment state $\mathcal{P}$ is approximately equal to the goal configuration $\mathcal{P}_{goal}$ .

2. Language-Guided Trajectory Generation

2.1 Input Encoding

Spatial and semantic instruction representations are constructed as follows:

Point clouds: Downsampled $\mathcal{P}\in\mathbb{R}^{N\times 3}$ (from RGB-D/SAM2 or simulation).
Language: Instruction $\mathcal{P}$ 0 (e.g., "fold left sleeve over back") embedded via LLaMA Instruct model, mean-pooled, projected to $\mathcal{P}$ 1 through an MLP.

Spatial features $\mathcal{P}$ 2 are extracted with PointNet++ to obtain $\mathcal{P}$ 3.

2.2 Conditional Variational Trajectory Model

The core of the planner is a Conditional Variational Autoencoder (CVAE) with transformer-based encoder and decoder:

Encoder: $\mathcal{P}$ 4, conditioned on features and ground-truth trajectory at training.
Latent variable: $\mathcal{P}$ 5 encodes diverse plausible fold trajectories.
Decoder: $\mathcal{P}$ 6, outputs a predicted trajectory.

During inference, latent is sampled $\mathcal{P}$ 7.

Trajectory features $\mathcal{P}$ 8 are decoded and projected to the predicted trajectory $\mathcal{P}$ 9.

2.3 Optimization

Training maximizes the evidence lower bound (ELBO):

$\mathcal{L}$ 0

This combines per-frame L2 trajectory reconstruction and KL regularization of the latent space.

3. Low-Level Foundation Model for Robotic Action

3.1 ManiFoundation Model

Action prediction is performed by ManiFoundation, a foundation model for contact synthesis:

Input: Pairs of successive point clouds $\mathcal{L}$ 1 and their computed flow.
Output: Contact locations and motion vectors $\mathcal{L}$ 2 with $\mathcal{L}$ 3 (grasp), $\mathcal{L}$ 4 (direction and distance).

3.2 Fine-Tuning and Ensemble Prediction

Dataset: Simulated folding episodes with known contact labels.
Loss: For each contact,

$\mathcal{L}$ 5

Ensembling: ManiFoundation is sampled with 160 seeds per step. Outputs are clustered within a small threshold $\mathcal{L}$ 6 in $\mathcal{L}$ 7-space; final action is selected as contact nearest the cluster mean.

3.3 Interface

The predicted trajectory $\mathcal{L}$ 8 is decomposed into consecutive pairs, each passed to ManiFoundation to yield the next robot action, closing the high- to low-level loop.

4. Empirical Evaluation and Comparative Performance

4.1 Datasets

MetaFold dataset: 1,210 meshes; 3,376 folding trials (2,664 train, 712 test); categories: no-sleeve, short-sleeve, long-sleeve, pants.
Zero-shot: 500 previously unseen Cloth3D garments.

4.2 Metrics

Rectangularity: $\mathcal{L}$ 9 (higher better).
Area Ratio: $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$ 0 (lower, more compact).
Success Rate: Fraction achieving thresholds on above metrics.

4.3 Baselines

UniGarmentManip: Dense visual correspondence.
GPT-Fabric: LLM + keypoint detection policy.
3D Diffusion Policy: End-to-end action model.
Deng et al.: Language-guided deformable manipulation.

4.4 Results

Category	MetaFold Rectangularity	UniG	DP3	MetaFold Success Rate	UniG	MetaFold Zero-Shot Success
No-sleeve	0.87	0.85	0.85	0.97	0.90	0.97
Short-sleeve	0.83	0.78	0.82	0.88	0.71	0.88
Long-sleeve	0.85	0.88	0.86	0.90	0.86	0.93
Pants	0.86	0.81	0.88	0.96	0.84	0.79

MetaFold achieves higher rectangularity and success rates across all garment types relative to prior methods. For previously unseen garments (Cloth3D), it achieves success rates of $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$ 1– $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$ 2. In language generalization (unseen instructions), success rates remain $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$ 3– $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$ 4 versus $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$ 5– $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$ 6 for prior work.

5. Ablation Studies and Qualitative Observations

Ablation experiments confirm the necessity of architectural components:

Ours w/o ManiFoundation: Naïve contact selection reduces success from 0.86 to 0.27.
Ours w/o closed-loop: Open-loop execution lowers success to $\mathcal{T} = \{\mathcal{P}_1,\dots,\mathcal{P}_M\}$ 70.07.
Reduced frequency planning (5 or 15 frames): Moderate decline in rectangularity/success.
Next-step only (no full trajectory): Success drops to 0.41.

Qualitative analyses (visualizations of ground-truth and generated folding trajectories) demonstrate coherent behavioral adaptation to different language instructions. Real-robot experiments with an xArm6 and RealSense D435 validate trajectory tracking and generalization across garment categories.

6. Generalization, Constraints, and Prospects

MetaFold’s use of raw point clouds (eschewing keypoints or garment templates) allows a single model to fold T-shirts, tank-tops, and pants. The CVAE and LLaMA Instruct encoder provide open-vocabulary instruction grounding, mapping user utterances to canonical fold actions.

Identified constraints include sensitivity to RGB-D segmentation errors, limitation to per-subtask planning (requiring pre-partitioned garment parts), and the stochasticity of ManiFoundation’s sampling (necessitating ensembling). Future research directions include:

Integrating part-aware segmentation for autonomous decomposition of multi-stage folding,
End-to-end fine-tuning to couple high- and low-level modules,
Exploiting stronger vision-LLMs for free-form instruction following,
Domain adaptation for sim-to-real transfer without heavy real-robot data requirements.

MetaFold demonstrates that modular disentanglement—trajectory planning conditioned on language and continuous robot contact optimization—enables high generalization and robust real-world performance for complex nonrigid manipulation tasks (Chen et al., 11 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MetaFold.