Fragmentation trees reloaded (1412.1929v3)

Published 5 Dec 2014 in q-bio.QM and cs.CE

Abstract: Metabolites, small molecules that are involved in cellular reactions, provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually relies on tandem mass spectrometry to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. Fragmentation trees have become a powerful tool for the interpretation of tandem mass spectrometry data of small molecules. These trees are found by combinatorial optimization, and aim at explaining the experimental data via fragmentation cascades. To obtain biochemically meaningful results requires an elaborate optimization function. We present a new scoring for computing fragmentation trees, transforming the combinatorial optimization into a maximum a posteriori estimator. We demonstrate the superiority of the new scoring for two tasks: Both for the de novo identification of molecular formulas of unknown compounds, and for searching a database for structurally similar compounds, our methods performs significantly better than the previous scoring, as well as other methods for this task. Our method can expedite the workflow for untargeted metabolomics, allowing researchers to investigate unknowns using automated computational methods.

Citations (173)

View on Semantic Scholar

Summary

The paper introduces a novel Bayesian scoring approach for fragmentation trees, improving automated workflows in untargeted metabolomics.
Evaluation on large datasets demonstrated significant improvement in molecular formula identification accuracy and chemical similarity searches compared to prior methods.
The method enhances workflow efficiency for untargeted studies, supporting drug discovery and biomarker identification, and offers a robust framework for future improvements.

Fragmentation Trees Reloaded: A Novel Approach in Untargeted Metabolomics

The paper "Fragmentation Trees Reloaded," authored by Kai Dürrhkop and Sebastian Böcker, presents an advanced methodology in the field of untargeted metabolomics. Metabolomics involves the comprehensive analysis and characterization of metabolites, small molecules within cells that directly reflect the cellular state. A primary tool for metabolomics is tandem mass spectrometry (MS/MS), which facilitates the identification of compounds in complex biological samples. Despite advancements in instrumentation, the majority of metabolites remain unidentified primarily due to the absence of reference spectra, especially for exotic or novel compounds.

The document introduces an innovative scoring approach for computing fragmentation trees (FTs) for MS/MS data interpretation. Their method pivots on transforming the combinatorial optimization challenge inherent in building FTs into a maximum a posteriori (MAP) estimation. This representation allows for a more systematic approach to constructing FTs, facilitating automated workflows in untargeted metabolomics.

Methodology and Scoring

The researchers propose a Bayesian framework that statistically models both the prior information and the likelihood of data for computing FTs. This framework offers significant advancements over previous scoring approaches. The decomposition of the scoring function integrates a variety of considerations, including:

Root and Edge Priors: These account for the likelihood of observing certain molecular fragments and sub-formulas. Common and plausible sub-fragments or losses are encoded, leveraging empirical knowledge gained from known metabolites.
Tree Size Prior: The methodology incorporates a bias towards larger trees, which typically indicate more comprehensive explanations of the spectra.
Intensity and Error Modeling: Noise in spectra is modeled using a long-tailed distribution (e.g., Pareto), and mass accuracy is modeled via a normal distribution. These models better represent the experimental conditions and nuances of MS/MS data.

The authors emphasize a strategy of hypothesis-driven recalibration to improve the quality of the resultant FTs. They also provide an entire workflow (illustrated in the paper) that systematically enhances FT computation via iterative parameter optimization.

Evaluation and Results

The experimental validation was performed using two major datasets: the GNPS and Agilent libraries. The datasets include thousands of metabolites analyzed in various conditions, reflecting realistic experimental setups. The authors employ a leave-one-out strategy for molecular formula identification and demonstrate a significant improvement in empirical performance metrics, clearly outperforming prior state-of-the-art methods.

Molecular Formula Identification: The optimized method achieves better rankings for the molecular formula of unknowns, with increased accuracy in top-ranked predictions compared to prior methodologies, such as SIRIUS² and earlier FT approaches.
Chemical Similarity Search: Their approach, leveraging computed FTs, generates similarity metrics for spectral library searches. The results indicate improved performance in retrieving chemically similar compounds, even when the query compound is not present in the reference library.

Implications and Future Directions

The advancements in FT scoring have notable implications for both practical applications and theoretical explorations in metabolomics. Practically, the method enhances the workflow efficiency for untargeted studies, supporting drug discovery, biomarker identification, and metabolic pathway exploration. Theoretically, this Bayesian framework provides a robust lens for contemplating further methodological enhancements, particularly in considering the structural information from MS/MS data.

The authors acknowledge the potential for integrating isotope pattern analysis to further augment identification accuracy, particularly with the constraints of real-world datasets where such data may not always be available.

In conclusion, "Fragmentation Trees Reloaded" significantly advances our approach to interpreting complex MS/MS data in untargeted metabolomics, providing a strong foundation for future research and technological developments in this critical scientific domain.