Papers
Topics
Authors
Recent
Search
2000 character limit reached

BiFold: Bimanual Cloth Folding with Language Guidance

Published 27 Jan 2025 in cs.RO and cs.CV | (2501.16458v2)

Abstract: Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-LLM and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves state-of-the-art performance on an existing language-conditioned folding benchmark. To address the lack of annotated bimanual folding data, we introduce a novel dataset with automatically parsed actions and language-aligned instructions, enabling better learning of text-conditioned manipulation. BiFold attains the best performance on our dataset and demonstrates strong generalization to new instructions, garments, and environments.

Summary

  • The paper presents BiFold, a novel model that leverages language guidance with pre-trained vision-language representations for bimanual cloth folding.
  • It employs a SigLIP-based architecture with LoRA fine-tuning and convolutional decoders to fuse RGB images and past observations for accurate action prediction.
  • Experimental evaluations demonstrate enhanced task success compared to unimanual methods, emphasizing its robustness and industrial applicability.

BiFold: Bimanual Cloth Folding with Language Guidance

The paper presents BiFold, a model aimed at enhancing robotic cloth manipulation by using bimanual folding actions guided through language instructions. The system leverages a pre-trained vision-LLM to translate abstract commands into precise robotic actions, efficiently tackling the complexities associated with varied garment materials, undefined goal sequences, and intricate manipulation.

Methodology

Architecture Design

BiFold uses the SigLIP model for generating high-level visual and text representations and leverages LoRA to fine-tune the adaptation for the specific task of cloth folding. The architectural design efficiently fuses information across RGB images, past observations, and text instructions using transformer encoders (Figure 1). The features generated are then decoded into action value maps through convolutional decoders, an approach selected over transformer-based decoders to avoid patch artifacts in low-data regimes. Figure 1

Figure 1: BiFold model architecture: We use a frozen SigLIP model and adapt it using LoRA to obtain tokens from an RGB image and an input text. The same encoders are used to incorporate past observations that provide context to the model.

For pick and place actions, BiFold samples the action positions directly from predicted distributions, constrained by segmentation masks to ensure picks occur within the cloth's area.

Dataset Annotation

BiFold introduces a new dataset of annotated bimanual folding demonstrations in a VR environment, where actions are parsed and paired with language instructions. The methodology includes re-rendering dataset images with diverse camera angles and realistic textures from CLOTH3D assets, thus addressing limitations in variety and generalization commonly seen in prior datasets.

Experimental Evaluation

Performance Analysis

BiFold's performance was benchmarked against existing models in both unimanual and bimanual settings, demonstrating superior accuracy in action prediction across varied instructions, including unseen tasks (Tables 1 and 2). This performance gain can be attributed to its robust architecture that incorporates context from past actions and efficient use of pre-trained vision-LLMs.

Unimanual vs. Bimanual Folding Tasks

The results showed significant improvement in task success rates, especially in unseen instructions and tasks, through BiFold's context-aware design (Table 1). The evaluations on simulation environments reinforced the models' robustness to visual changes and adaptability to diverse language prompts (Table 3). Figure 2

Figure 2: Re-rendering step: The only visual input in the original dataset is given as a colored point cloud with uniform colors and patterns (left). We take the simulation mesh, and randomly choose camera position (center). Finally, we apply a texture to the mesh and render RGB-D images (right).

Challenges and Future Directions

Simulation and Real-World Application

A key challenge lies in minimizing the sim-to-real gap and ensuring robust real-world application, given variability in garment textures and dynamics. Future improvements could involve refining the simulation accuracy, leveraging more sophisticated physics engines, and integrating dynamic policy planning using LLM-based solutions.

Potential Applications and Enhancements

The modularity of BiFold's design offers significant potential for expansion into varied manipulation tasks in industries such as healthcare, where adaptive and efficient manipulation is vital. Moreover, augmenting BiFold with advanced feedback mechanisms and dynamic context management could enhance its ability to handle real-world uncertainties and nuanced tasks.

Conclusion

BiFold's innovative approach marks a substantial step forward in language-guided robotic cloth manipulation, successfully addressing many of the intrinsic challenges of garment handling with a focus on practical industrial applications. By pioneering an effective combination of large-scale vision-LLMs with novel dataset generation techniques, BiFold sets a new benchmark in the bimanual manipulation field, offering a pathway towards more autonomous and intelligent robotic systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.