Papers
Topics
Authors
Recent
2000 character limit reached

Tackling dataset curation challenges towards reliable machine learning: a case study on thermoelectric materials (2512.18653v1)

Published 21 Dec 2025 in cond-mat.mtrl-sci

Abstract: Machine Learning (ML) driven discovery of novel and efficient thermoelectric (TE) materials warrants experimental TE datasets of high volume, diversity, and quality. While the largest publicly available dataset, Starrydata2, has a high data volume, it contains inaccurate data due to the inherent limitations of LLM-assisted data curation, ambiguous nomenclature and complex formulas of materials in the literature. Another unaddressed issue is the inclusion of multi-source experimental data, with high standard deviations and without synthesis information. Using half-Heusler (hH) materials as an example, this work is aimed at first highlighting these errors and inconsistencies which cannot be filtered with conventional dataset curation workflows. We then propose a statistical round-robin error-based data filtering method to address these issues, a method that can be applied to filter any other material property. Lastly, a hybrid dataset creation workflow, involving data from Starrydata2 and manual extraction, is proposed and the resulting dataset is analyzed and compared against Starrydata2.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.