CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception (2505.21495v1)

Published 27 May 2025 in cs.RO

Abstract: Robust robot manipulation in unstructured environments often requires understanding object properties that extend beyond geometry, such as material or compliance-properties that can be challenging to infer using vision alone. Multimodal haptic sensing provides a promising avenue for inferring such properties, yet progress has been constrained by the lack of large, diverse, and realistic haptic datasets. In this work, we introduce the CLAMP device, a low-cost (<\$200) sensorized reacher-grabber designed to collect large-scale, in-the-wild multimodal haptic data from non-expert users in everyday settings. We deployed 16 CLAMP devices to 41 participants, resulting in the CLAMP dataset, the largest open-source multimodal haptic dataset to date, comprising 12.3 million datapoints across 5357 household objects. Using this dataset, we train a haptic encoder that can infer material and compliance object properties from multimodal haptic data. We leverage this encoder to create the CLAMP model, a visuo-haptic perception model for material recognition that generalizes to novel objects and three robot embodiments with minimal finetuning. We also demonstrate the effectiveness of our model in three real-world robot manipulation tasks: sorting recyclable and non-recyclable waste, retrieving objects from a cluttered bag, and distinguishing overripe from ripe bananas. Our results show that large-scale, in-the-wild haptic data collection can unlock new capabilities for generalizable robot manipulation. Website: https://emprise.cs.cornell.edu/clamp/

Collections

Summary

A Critical Analysis of CLAMP: Crowdsourcing Large-scale Haptic Data for Multimodal Robot Perception

The paper introduces the CLAMP framework, encompassing a novel device, dataset, and model designed to advance multimodal robot perception. This work specifically targets the integration of haptic and visual data to improve robotic manipulation capabilities in unstructured environments, where understanding non-geometrical object properties like material and compliance is crucial.

Contributions and Technical Features

The CLAMP device is a low-cost sensor platform tailored for haptic data acquisition. By incorporating multiple sensing modalities—including active and passive thermal sensors, force sensors, microphones, and inertial measurement units (IMUs)—the device is capable of capturing nuanced haptic interactions. This multimodal approach is innovative in its attempt to cover a wide array of sensory inputs, addressing the limitations of existing haptic sensing technologies, which often focus on a single modality.

The CLAMP dataset is claimed to be the largest open-source multimodal haptic dataset. It comprises over 12.3 million data points collected in-the-wild from more than 5000 household objects. The dataset emphasizes diversity and scalability, being sourced from 41 non-expert users interacting with 16 devices. It offers extensive material diversity, which aids in training models to generalize better across various real-world scenarios.

Central to the paper is the CLAMP model, a visuo-haptic perception model trained using this dataset. The model leverages a haptic encoder based on InceptionTime architecture and a visual encoder from GPT-4o. This combination allows the model to integrate heterogeneous data streams effectively, improving material classification accuracy. The paper provides extensive comparison with state-of-the-art vision-only models, demonstrating superior performance by utilizing multimodal data.

Numerical Results and Discussion

The paper reports that the CLAMP model achieves an overall accuracy of 87% in material classification tasks, outperforming vision-only models like GPT-4o and CLIP significantly. Furthermore, the model is finetuned across different robotic embodiments, demonstrating that minimal additional haptic data is required for effective cross-embodiment adaptations. This flexibility is particularly useful for applications involving different types of grippers and robotic setups.

In real-world trials, the CLAMP model facilitated tasks such as waste sorting, cluttered object retrieval, and banana ripeness classification, achieving high success rates. For example, in sorting recyclable items, the model accurately classified objects 9 out of 10 times, significantly better than baseline vision methods.

Implications and Future Directions

This research holds implications for scalability in robot manipulation tasks. By harnessing crowdsourced haptic data, it paves the way for the development of more robust perception models. The combination of tactile and visual modalities in a unified framework may inspire future work in the field of multimodal machine learning, encouraging further exploration into sensory fusion techniques for robotics.

Potential avenues for further research include expanding the dataset to encompass more diverse objects or environments and improving sensor integration to capture richer data streams. The device design might also be optimized for higher bandwidth data collection, which would facilitate real-time analysis and decision-making in robotics applications.

Conclusion

The CLAMP project stands out as a comprehensive attempt to leverage multimodal data for enhancing robotic perception in unstructured environments. While the current work lays the groundwork, ongoing efforts in refining the device, expanding the dataset, and evolving model architectures are essential to realize the full potential of multimodal haptic sensing in robotics.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (14)

YouTube

Show All Videos