- The paper presents BiFold, a novel model that leverages language guidance with pre-trained vision-language representations for bimanual cloth folding.
- It employs a SigLIP-based architecture with LoRA fine-tuning and convolutional decoders to fuse RGB images and past observations for accurate action prediction.
- Experimental evaluations demonstrate enhanced task success compared to unimanual methods, emphasizing its robustness and industrial applicability.
BiFold: Bimanual Cloth Folding with Language Guidance
The paper presents BiFold, a model aimed at enhancing robotic cloth manipulation by using bimanual folding actions guided through language instructions. The system leverages a pre-trained vision-LLM to translate abstract commands into precise robotic actions, efficiently tackling the complexities associated with varied garment materials, undefined goal sequences, and intricate manipulation.
Methodology
Architecture Design
BiFold uses the SigLIP model for generating high-level visual and text representations and leverages LoRA to fine-tune the adaptation for the specific task of cloth folding. The architectural design efficiently fuses information across RGB images, past observations, and text instructions using transformer encoders (Figure 1). The features generated are then decoded into action value maps through convolutional decoders, an approach selected over transformer-based decoders to avoid patch artifacts in low-data regimes.
Figure 1: BiFold model architecture: We use a frozen SigLIP model and adapt it using LoRA to obtain tokens from an RGB image and an input text. The same encoders are used to incorporate past observations that provide context to the model.
For pick and place actions, BiFold samples the action positions directly from predicted distributions, constrained by segmentation masks to ensure picks occur within the cloth's area.
Dataset Annotation
BiFold introduces a new dataset of annotated bimanual folding demonstrations in a VR environment, where actions are parsed and paired with language instructions. The methodology includes re-rendering dataset images with diverse camera angles and realistic textures from CLOTH3D assets, thus addressing limitations in variety and generalization commonly seen in prior datasets.
Experimental Evaluation
BiFold's performance was benchmarked against existing models in both unimanual and bimanual settings, demonstrating superior accuracy in action prediction across varied instructions, including unseen tasks (Tables 1 and 2). This performance gain can be attributed to its robust architecture that incorporates context from past actions and efficient use of pre-trained vision-LLMs.
Unimanual vs. Bimanual Folding Tasks
The results showed significant improvement in task success rates, especially in unseen instructions and tasks, through BiFold's context-aware design (Table 1). The evaluations on simulation environments reinforced the models' robustness to visual changes and adaptability to diverse language prompts (Table 3).
Figure 2: Re-rendering step: The only visual input in the original dataset is given as a colored point cloud with uniform colors and patterns (left). We take the simulation mesh, and randomly choose camera position (center). Finally, we apply a texture to the mesh and render RGB-D images (right).
Challenges and Future Directions
Simulation and Real-World Application
A key challenge lies in minimizing the sim-to-real gap and ensuring robust real-world application, given variability in garment textures and dynamics. Future improvements could involve refining the simulation accuracy, leveraging more sophisticated physics engines, and integrating dynamic policy planning using LLM-based solutions.
Potential Applications and Enhancements
The modularity of BiFold's design offers significant potential for expansion into varied manipulation tasks in industries such as healthcare, where adaptive and efficient manipulation is vital. Moreover, augmenting BiFold with advanced feedback mechanisms and dynamic context management could enhance its ability to handle real-world uncertainties and nuanced tasks.
Conclusion
BiFold's innovative approach marks a substantial step forward in language-guided robotic cloth manipulation, successfully addressing many of the intrinsic challenges of garment handling with a focus on practical industrial applications. By pioneering an effective combination of large-scale vision-LLMs with novel dataset generation techniques, BiFold sets a new benchmark in the bimanual manipulation field, offering a pathway towards more autonomous and intelligent robotic systems.