Scaling and value quantification of multimodal molecular fusion at foundation-model scale
Determine the feasibility and performance characteristics of scaling multimodal molecular representation fusion architectures—such as combinations of graph, text (SMILES), and 2D/3D representations—to pre-training on datasets exceeding 100 million molecules, and quantify the contribution of such multimodal combinations to downstream drug discovery tasks across diverse benchmarks.
References
While the molecular use-case has been explored in several publications, primarily in the context of combining graph with a text-based model or 2-dimensional and 3-dimensional molecular representations, the scaling of such approach to large scale pre-training ($>100 \text{M}$) and the quantification of the value of such combinations on downstream drug discovery tasks remain an open question.