Dice Question Streamline Icon: https://streamlinehq.com

Determine effects of 3D pre-training data accuracy versus diversity on downstream performance

Ascertain a definitive conclusion on how 3D pre-training data accuracy (e.g., DFT-calculated equilibrium conformations) versus 3D data diversity (e.g., RDKit-generated conformers spanning broader chemical space) respectively affect downstream task performance within UniCorn’s molecular representation learning framework.

Information Square Streamline Icon: https://streamlinehq.com

Background

The authors analyze two sources of 3D pre-training data—high-accuracy but less diverse DFT-calculated conformations and more diverse RDKit-generated conformations—and report initial observations about when accuracy or diversity may be more beneficial.

However, they explicitly state that they do not yet have a definitive conclusion on the overall impact, identifying a need for a more conclusive and systematic determination.

References

While we do not have a definitive conclusion, we have observed some phenomena that offer valuable insights for the community.

UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning (2405.10343 - Feng et al., 15 May 2024) in Appendix, The Impact of Data Accuracy and Diversity (Section S.5)