Dice Question Streamline Icon: https://streamlinehq.com

Observability of information-theoretic compression limits in atomistic datasets

Determine whether lossless compression limits predicted by information theory can be observed in atomistic datasets composed of local atomic environment descriptors used in machine learning interatomic potentials.

Information Square Streamline Icon: https://streamlinehq.com

Background

Information theory provides formal limits on lossless compression based on the entropy of data distributions. The paper raises the question of whether these theoretical compression limits manifest empirically in atomistic datasets, which consist of descriptors of local atomic environments.

Establishing this would underpin principled dataset compression and curation strategies for training machine learning interatomic potentials by linking entropy saturation to achievable reductions in dataset size without loss of essential information.

References

The theoretical results from information theory already guarantee the compression limits that can be applied to any generic dataset, but it is not clear whether the same effect can be observed in atomistic datasets.

Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory (2404.12367 - Schwalbe-Koda et al., 18 Apr 2024) in Results, Subsection "Information-theoretical dataset analysis for machine learning potentials"