FraGNNet: A Deep Probabilistic Model for Mass Spectrum Prediction (2404.02360v1)
Abstract: The process of identifying a compound from its mass spectrum is a critical step in the analysis of complex mixtures. Typical solutions for the mass spectrum to compound (MS2C) problem involve matching the unknown spectrum against a library of known spectrum-molecule pairs, an approach that is limited by incomplete library coverage. Compound to mass spectrum (C2MS) models can improve retrieval rates by augmenting real libraries with predicted spectra. Unfortunately, many existing C2MS models suffer from problems with prediction resolution, scalability, or interpretability. We develop a new probabilistic method for C2MS prediction, FraGNNet, that can efficiently and accurately predict high-resolution spectra. FraGNNet uses a structured latent space to provide insight into the underlying processes that define the spectrum. Our model achieves state-of-the-art performance in terms of prediction error, and surpasses existing C2MS models as a tool for retrieval-based MS2C.
- Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics, 11(1):98–110, February 2015. ISSN 1573-3890. doi: 10.1007/s11306-014-0676-4. URL https://doi.org/10.1007/s11306-014-0676-4.
- Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261 [cs, stat], October 2018. URL http://arxiv.org/abs/1806.01261.
- Cython: The best of both worlds. Computing in Science & Engineering, 13(2):31–39, 2011.
- The Properties of Known Drugs. 1. Molecular Frameworks. Journal of Medicinal Chemistry, 39(15):2887–2893, January 1996. ISSN 0022-2623. doi: 10.1021/jm9602928. URL https://doi.org/10.1021/jm9602928.
- Biemann, K. The application of mass spectrometry in organic chemistry: Determination of the structure of natural products. Angewandte Chemie International Edition in English, 1(2):98–111, 1962. doi: https://doi.org/10.1002/anie.196200981. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/anie.196200981.
- Biewald, L. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.
- MS2mol: A transformer model for illuminating dark chemical space from mass spectra, 2023. URL https://chemrxiv.org/engage/chemrxiv/article-details/6492507524989702c2b082fc.
- Advances in high‐throughput mass spectrometry in drug discovery. EMBO Molecular Medicine, 15(1):e14850, December 2022. ISSN 1757-4676. doi: 10.15252/emmm.202114850. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9832828/.
- Searching molecular structure databases with tandem mass spectra using csi:fingerid. Proceedings of the National Academy of Sciences, 112(41):12580–12585, 2015. ISSN 0027-8424. doi: 10.1073/pnas.1509788112. URL https://www.pnas.org/content/112/41/12580.
- Reoptimization of MDL keys for use in drug discovery. Journal of Chemical Information and Computer Sciences, 42(6):1273–1280, 2002. ISSN 0095-2338. doi: 10.1021/ci010132r.
- SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nature Methods, 16(4):299–302, April 2019. ISSN 1548-7105. doi: 10.1038/s41592-019-0344-8. URL http://www.nature.com/articles/s41592-019-0344-8.
- Falcon, W. and The PyTorch Lightning team. PyTorch Lightning, March 2019. URL https://github.com/Lightning-AI/lightning.
- Fast graph representation learning with pytorch geometric. CoRR, abs/1903.02428, 2019. URL http://arxiv.org/abs/1903.02428.
- Prefix-Tree Decoding for Predicting Mass Spectra from Molecules. In Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 48548–48572. Curran Associates, Inc., 2023a. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/97d596ca21d0751ba2c633bad696cf7f-Paper-Conference.pdf.
- Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nature Machine Intelligence, 5(9):965–979, 2023b. ISSN 2522-5839. doi: 10.1038/s42256-023-00708-3. URL https://www.nature.com/articles/s42256-023-00708-3. Number: 9 Publisher: Nature Publishing Group.
- MIST-CF: Chemical Formula Inference from Tandem Mass Spectra. Journal of Chemical Information and Modeling, September 2023c. ISSN 1549-9596. doi: 10.1021/acs.jcim.3c01082. URL https://doi.org/10.1021/acs.jcim.3c01082. Publisher: American Chemical Society.
- Generating Molecular Fragmentation Graphs with Autoregressive Neural Networks. Analytical Chemistry, 96(8):3419–3428, February 2024. ISSN 0003-2700. doi: 10.1021/acs.analchem.3c04654. URL https://doi.org/10.1021/acs.analchem.3c04654. Publisher: American Chemical Society.
- Overview of Mass Spectrometry-Based Metabolomics: Opportunities and Challenges. Methods in molecular biology (Clifton, N.J.), 1198:3–12, 2014. ISSN 1064-3745. doi: 10.1007/978-1-4939-1258-2_1. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4336784/.
- Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
- InChI, the IUPAC International Chemical Identifier. Journal of Cheminformatics, 7(1):23, May 2015. ISSN 1758-2946. doi: 10.1186/s13321-015-0068-4. URL https://doi.org/10.1186/s13321-015-0068-4.
- 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. Bioinformatics, 39(6):btad354, May 2023. ISSN 1367-4811. doi: 10.1093/bioinformatics/btad354. URL https://doi.org/10.1093/bioinformatics/btad354. _eprint: https://academic.oup.com/bioinformatics/article-pdf/39/6/btad354/50661428/btad354.pdf.
- matchms - processing and similarity evaluation of mass spectrometry data. Journal of Open Source Software, 5(52):2411, August 2020. ISSN 2475-9066. doi: 10.21105/joss.02411. URL https://joss.theoj.org/papers/10.21105/joss.02411.
- PubChem 2019 update: improved access to chemical data. Nucleic Acids Research, 47(Database issue):D1102–D1109, January 2019. ISSN 0305-1048. doi: 10.1093/nar/gky1033. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6324075/.
- Robust optimization as data augmentation for large-scale graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 60–69, 2022.
- Kuhn, H. W. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, March 1955. ISSN 0028-1441, 1931-9193. doi: 10.1002/nav.3800020109. URL https://onlinelibrary.wiley.com/doi/10.1002/nav.3800020109.
- Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html.
- Landrum, G. Rdkit: Open-source cheminformatics, 2022. URL http://www.rdkit.org.
- Lebedev, A. T. Environmental Mass Spectrometry. Annual Review of Analytical Chemistry, 6(1):163–189, 2013. doi: 10.1146/annurev-anchem-062012-092604. URL https://doi.org/10.1146/annurev-anchem-062012-092604.
- Ensemble Spectral Prediction (ESP) Model for Metabolite Annotation. arXiv:2203.13783 [cs, q-bio], March 2022. URL http://arxiv.org/abs/2203.13783. arXiv: 2203.13783.
- McLafferty, F. W. Mass Spectrometric Analysis. Molecular Rearrangements. Analytical Chemistry, 31(1):82–87, January 1959. ISSN 0003-2700. doi: 10.1021/ac60145a015. URL https://doi.org/10.1021/ac60145a015. Publisher: American Chemical Society.
- Efficiently predicting high resolution mass spectra with graph neural networks. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 25549–25562. PMLR, July 2023. URL https://proceedings.mlr.press/v202/murphy23a.html.
- PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry. Journal of Chemical Information and Modeling, 57(6):1300–1308, June 2017. ISSN 1549-9596. doi: 10.1021/acs.jcim.7b00083. URL https://doi.org/10.1021/acs.jcim.7b00083.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs, stat], December 2019. URL http://arxiv.org/abs/1912.01703.
- Peters, F. T. Recent advances of liquid chromatography–(tandem) mass spectrometry in clinical and forensic toxicology. Clinical Biochemistry, 44(1):54–65, January 2011. ISSN 0009-9120. doi: 10.1016/j.clinbiochem.2010.08.008. URL https://www.sciencedirect.com/science/article/pii/S0009912010003486.
- Python Core Team. Python: A dynamic, open source programming language. Python Software Foundation, 2021. URL https://www.python.org/.
- Automatic Compound Annotation from Mass Spectrometry Data Using MAGMa. Mass Spectrometry, 3(Special_Issue_2):S0033–S0033, 2014. doi: 10.5702/massspectrometry.S0033.
- Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50(5):742–754, May 2010. ISSN 1549-9596. doi: 10.1021/ci100050t. URL https://doi.org/10.1021/ci100050t.
- MetFrag relaunched: incorporating strategies beyond in silico fragmentation. Journal of Cheminformatics, 8(1):3, January 2016. ISSN 1758-2946. doi: 10.1186/s13321-016-0115-9. URL https://doi.org/10.1186/s13321-016-0115-9.
- Critical Assessment of Small Molecule Identification 2016: automated methods. Journal of Cheminformatics, 9(1):22, March 2017. ISSN 1758-2946. doi: 10.1186/s13321-017-0207-1. URL https://doi.org/10.1186/s13321-017-0207-1.
- Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research, 12(77):2539–2561, 2011. URL http://jmlr.org/papers/v12/shervashidze11a.html.
- MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra. Biomolecules, 11(12):1793, November 2021. ISSN 2218-273X. doi: 10.3390/biom11121793. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8699281/.
- A deep generative model enables automated structure elucidation of novel psychoactive substances. Nature Machine Intelligence, 3(11):973–984, November 2021. ISSN 2522-5839. doi: 10.1038/s42256-021-00407-x. URL https://www.nature.com/articles/s42256-021-00407-x. Number: 11 Publisher: Nature Publishing Group.
- Stein, S. Mass Spectral Reference Libraries: An Ever-Expanding Resource for Chemical Identification. Analytical Chemistry, 84(17):7274–7282, September 2012. ISSN 0003-2700. doi: 10.1021/ac301205z. URL https://doi.org/10.1021/ac301205z.
- Optimization and testing of mass spectral library search algorithms for compound identification. Journal of the American Society for Mass Spectrometry, 5(9):859–866, September 1994. ISSN 1044-0305. doi: 10.1016/1044-0305(94)87009-8. URL https://pubs.acs.org/doi/10.1016/1044-0305%2894%2987009-8.
- MSNovelist: de novo structure generation from mass spectra. Nature Methods, 19(7):865–870, July 2022. ISSN 1548-7105. doi: 10.1038/s41592-022-01486-3. URL https://www.nature.com/articles/s41592-022-01486-3. Number: 7 Publisher: Nature Publishing Group.
- Fourier features let networks learn high frequency functions in low dimensional domains. CoRR, abs/2006.10739, 2020. URL https://arxiv.org/abs/2006.10739.
- A spectroscopic test suggests that fragment ion structure annotations in MS/MS libraries are frequently incorrect. Communications Chemistry, 7(1):1–11, February 2024. ISSN 2399-3669. doi: 10.1038/s42004-024-01112-7. URL https://www.nature.com/articles/s42004-024-01112-7. Number: 1 Publisher: Nature Publishing Group.
- Ms2prop: A machine learning model that directly predicts chemical properties from mass spectrometry data for novel compounds. bioRxiv, 2022. doi: 10.1101/2022.10.09.511482. URL https://www.biorxiv.org/content/early/2022/10/11/2022.10.09.511482.
- CFM-ID 4.0: More Accurate ESI-MS/MS Spectral Prediction and Compound Identification. Analytical Chemistry, August 2021. ISSN 0003-2700. doi: 10.1021/acs.analchem.1c01465. URL https://doi.org/10.1021/acs.analchem.1c01465.
- Deep Learning-Enabled MS/MS Spectrum Prediction Facilitates Automated Identification Of Novel Psychoactive Substances. Analytical Chemistry, 95(50):18326–18334, December 2023. ISSN 0003-2700. doi: 10.1021/acs.analchem.3c02413. URL https://doi.org/10.1021/acs.analchem.3c02413. Publisher: American Chemical Society.
- Rapid prediction of electron–ionization mass spectrometry using neural networks. ACS Central Science, 5(4):700–708, April 2019. ISSN 2374-7943, 2374-7951. doi: 10.1021/acscentsci.9b00085. URL https://pubs.acs.org/doi/10.1021/acscentsci.9b00085.
- HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Research, 46(D1):D608–D617, January 2018. ISSN 1362-4962. doi: 10.1093/nar/gkx1089.
- In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics, 11(1):148, March 2010. ISSN 1471-2105. doi: 10.1186/1471-2105-11-148. URL https://doi.org/10.1186/1471-2105-11-148.
- MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530, January 2018. ISSN 2041-6539. doi: 10.1039/C7SC02664A. URL https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a. Publisher: The Royal Society of Chemistry.
- How Powerful are Graph Neural Networks? September 2018. URL https://openreview.net/forum?id=ryGs6iA5Km.
- Quality Control for Building Libraries from Electrospray Ionization Tandem Mass Spectra. Analytical Chemistry, 86(13):6393–6400, July 2014. ISSN 0003-2700, 1520-6882. doi: 10.1021/ac500711m. URL https://pubs.acs.org/doi/10.1021/ac500711m.
- Do transformers really perform bad for graph representation? Neural Information Processing Systems (NeurIPS), 2021.
- MassFormer: Tandem Mass Spectrum Prediction for Small Molecules using Graph Transformers, May 2023. URL http://arxiv.org/abs/2111.04824. arXiv:2111.04824 [cs, q-bio].
- Using Graph Neural Networks for Mass Spectrometry Prediction. arXiv:2010.04661 [cs], October 2020. URL http://arxiv.org/abs/2010.04661.
- Rapid Approximate Subset-Based Spectra Prediction for Electron Ionization–Mass Spectrometry. Analytical Chemistry, 95(5):2653–2663, February 2023. ISSN 0003-2700. doi: 10.1021/acs.analchem.2c02093. URL https://doi.org/10.1021/acs.analchem.2c02093. Publisher: American Chemical Society.