Scaling and value quantification of multimodal molecular fusion at foundation-model scale

Determine the feasibility and performance characteristics of scaling multimodal molecular representation fusion architectures—such as combinations of graph, text (SMILES), and 2D/3D representations—to pre-training on datasets exceeding 100 million molecules, and quantify the contribution of such multimodal combinations to downstream drug discovery tasks across diverse benchmarks.

Background

Multimodal fusion has been explored for molecular representation learning, typically by combining graph with text-based models or by integrating 2D and 3D molecular information. However, prior work has not established whether such fusion approaches can be effectively scaled to very large pre-training datasets comparable to modern foundation models.

Beyond scalability, it remains unresolved how much benefit each modality adds when combined, particularly in the context of downstream drug discovery tasks where modalities may contribute complementary information. The authors introduce MMELON to investigate these questions but identify the broader issue as open.

References

While the molecular use-case has been explored in several publications, primarily in the context of combining graph with a text-based model or 2-dimensional and 3-dimensional molecular representations, the scaling of such approach to large scale pre-training ($>100 \text{M}$) and the quantification of the value of such combinations on downstream drug discovery tasks remain an open question.

— Multi-view biomedical foundation models for molecule-target and property prediction (2410.19704 - Suryanarayanan et al., 25 Oct 2024) in Introduction (Section 1)

Scaling and value quantification of multimodal molecular fusion at foundation-model scale

Background

References

Related Problems