FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time

Published 17 Apr 2026 in cs.LG and q-bio.QM | (2604.16648v1)

Abstract: In this work, we present FRIGID, a framework with a novel diffusion LLM that generates molecular structures conditioned on mass spectra via intermediate fingerprint representations and determined chemical formulae, training at the scale of hundreds of millions of unlabeled structures. We then demonstrate how forward fragmentation models enable inference-time scaling by identifying spectrum-inconsistent fragments and refining them through targeted remasking and denoising. While FRIGID already achieves strong performance with its diffusion base, inference-time scaling significantly improves its accuracy, surpassing 18% Top-1 accuracy on the challenging MassSpecGym benchmark and tripling the Top-1 accuracy of the leading methods on NPLIB1. Further empirical analyses show that FRIGID exhibits log-linear performance scaling with increasing inference-time compute, opening a promising new direction for continued improvements in de novo structural elucidation. FRIGID code is publicly available at https://github.com/coleygroup/FRIGID

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a diffusion language model that leverages fragment-based representations and ICEBERG-guided iterative correction to enhance molecular generation accuracy from MS/MS spectra.
FRIGID achieves log-linear scaling with increased training and inference compute, tripling Top-1 accuracy on benchmarks while significantly reducing runtime.
The methodology integrates a BERT-style masked diffusion framework with cross-attention conditioning on molecular formulas and fingerprints to advance de novo molecular elucidation.

FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time

Introduction and Paradigm Shift in Mass Spectral Elucidation

The structural elucidation of small molecules from tandem mass spectrometry (MS/MS) is a central challenge in analytical chemistry with broad implications in metabolomics, drug discovery, and environmental science. Traditional retrieval-based methods are constrained by the coverage of chemical libraries, failing on truly novel compounds. This has driven the field toward end-to-end de novo molecular generation conditioned directly on spectral data. Autoregressive models and graph diffusion approaches have advanced the state-of-the-art, but their scalability—particularly for inference and training in ultra-large chemical spaces—remains inadequate.

FRIGID (Fragment Refinement via ICEBERG-Guided Inference Diffusion) introduces a diffusion LLM (DLM) that exploits fragment-based representations (SAFE) and operates as a masked diffusion LLM (MDLM), enabling efficient training on scale and rapid, accurate inference. Furthermore, FRIGID tightens the feedback loop with a neural simulator (ICEBERG) for targeted, iterative correction at inference, achieving unprecedented accuracy and log-linear scaling with inference compute on challenging benchmarks.

Diffusion LLM Architecture and Conditioning

FRIGID-base models the conditional distribution over fragment-based SAFE token sequences given a molecular formula and fingerprint, the latter predicted from the spectrum using MIST. The model employs a BERT-style MDLM, parameterizing denoising transitions non-autoregressively, which significantly accelerates inference compared to graph-diffusion competitors.

Conditioning is achieved via cross-attention mechanisms: the molecular formula is embedded as a count vector over common elements, and the fingerprint is represented as an active subset of Morgan bit indices, also attentively embedded. These conditioning signals enable robust generation constrained both by experimental formula and spectral features.

To address variable molecular sizes, the sequence length is predicted by a separate NGBoost regression tree ensemble conditioned on the formula, striking a balance between exploration and wasteful decoding.

Figure 1: (a) FRIGID-base utilizes a masked diffusion LLM (MDLM) for SAFE sequence generation, jointly conditioned on precursor formula and spectrum-derived fingerprint, with candidate ranking by fingerprint similarity. (b) The inference-time scaling pipeline leverages ICEBERG-consistency to selectively denoise inconsistent substructures.

Cycle-Consistent Inference-Time Scaling

Conventional diffusion models offer only limited avenues for inference-time improvement (e.g., longer denoising trajectories), which become prohibitive for complex graph domains. FRIGID innovates by integrating ICEBERG, a neural forward fragmentation simulator, to assess cycle consistency: simulated spectra from generated candidates are compared to experimental data, identifying "hallucinated" peaks not observed experimentally. This enables granular, atom-wise scoring of substructure (via fragments) as consistent or inconsistent with the spectrum, which is then aggregated to the token level for targeted remasking and re-denoising.

This targeted refinement is iterated round-after-round, with high-scoring candidates further refined, and new diversity-increasing samples interleaved. This framework generalizes both to pure sequence DLMs and graph-based models, but FRIGID's sequence backbone offers several orders of magnitude gains in practical inference throughput.

Figure 2: ICEBERG-guided inference-time scaling mechanism. Substructure inconsistencies are identified via simulated-observed spectrum discrepancy, assigned as atom-level scores, and used for recurrent targeted denoising.

Empirical Performance: Scaling, Accuracy, and Efficiency

Training Compute Scaling

FRIGID demonstrates log-linear scaling in accuracy with increasing volume of pretraining data, with improvements persisting from millions to a billion unlabeled structures. Even with modest compute, FRIGID surpasses previous baselines, and scaling both model and data reliably drives further gains.

Figure 3: Top-1 exact-match accuracy as a function of training compute on NPLIB1. FRIGID-base shows strong scaling, further improved by ICEBERG-guided inference-time refinement.

Benchmark Results

On both the NPLIB1 and MassSpecGym datasets, FRIGID establishes a dominant position, achieving state-of-the-art Top-1 and Top-10 exact-match accuracy, and improved maximum structural similarity (Tanimoto) and minimum MCES (graph edit distance). On NPLIB1, FRIGID triples the Top-1 accuracy of leading prior methods. On MassSpecGym, it achieves over 18% Top-1 accuracy—substantially higher than all earlier systems.

Inference runtime is another major strength: FRIGID's base model decodes a spectrum in 6.58 seconds, more than an order of magnitude faster than DiffMS. This speedup is essential to making iterative inference-time scaling feasible without prohibitive compute costs.

Inference-Time Scaling and Ablation

FRIGID's ICEBERG-guided correction provides consistent, compute-efficient improvements at inference. The increase in accuracy as a function of expended inference time is log-linear, with no saturation observed even at the highest tested compute budgets. Ablations show that random or untargeted correction strategies yield minor improvements, but the targeted, cycle-consistent mechanism is critical to FRIGID's performance.

Figure 4: Inference-time Top-1 accuracy scaling on NPLIB1 and MassSpecGym. The blue dashed line shows log-linear improvement with increased refinement.

Qualitative Analysis and Error Characterization

Qualitative examples indicate two primary classes of FRIGID success: immediate correct generation and success after several rounds of iterative correction, wherein initially hallucinated fragments are identified and removed in later rounds.

Figure 5: NPLIB1 example spectra. Rows 1–2: Immediate success; rows 3–4: correction via iterative inference leads to final discovery of the true molecule.

Failure analysis reveals both persistent inability to generate the correct structure, and rare cases where inference-time correction can worsen the rank order. The model is robust even when upstream fingerprint predictions are noisy, and close/meaningful matches (as measured by sub-structure similarity) are common even when the exact structure eludes reconstruction.

Figure 6: NPLIB1 error cases. Row 1 shows persistent failure of candidate generation; row 2 demonstrates occasional negative effects of repeated correction.

Analogous results are observed for MassSpecGym, further illustrating FRIGID's capacity for both immediate and eventual success, as well as manageably rare failure cases.

Figure 7: MassSpecGym examples: rows 1–2 are immediate successes; rows 3–4 are refined by iterative correction; row 5 is a failure case after multiple rounds.

Theoretical and Practical Implications, and AI Perspective

FRIGID demonstrates the power of sequence-based diffusion language modeling—previously validated in text and general molecular design—for the complex, under-constrained problem of spectral structural elucidation. Key theoretical insights include:

Fragment-centric conditioning robustly constrains space while supporting tractable large-scale training.
Cycle-consistent inference-time refinement reshapes generative search, utilizing dual neural models for effective semantic correction.
Log-linear scaling trends in both training and inference are reminiscent of LLM scaling laws [kaplan2020scalinglawsneurallanguage], reinforcing the transferability of these principles to the molecular domain.

Practically, this makes high-throughput, de novo annotation of untargeted MS/MS spectra feasible at scale, even when the correct structure lies far outside known chemical libraries. The backbone is modular: alternative spectral fingerprint predictors, spectral simulators, or representations (e.g., SELFIES) can be readily integrated. The framework is amenable to further optimizations, including combined test-time adaptation and even more advanced feedback mechanisms integrating other spectral modalities.

Future directions may include training on richer multi-modal data, expanded fragment ontologies for spectral simulation, or combining with retrieval/candidate enumeration pipelines for open-world metabolomics.

Conclusion

FRIGID sets a new benchmark for spectrum-to-molecule generation: it achieves higher accuracy, unmatched practical efficiency, and strong robustness to prediction errors. The combination of scalable masked diffusion language modeling with cycle-consistent, simulator-guided iterative correction constitutes a conceptual advance in molecular generative modeling, with implications for both theory and practice across chemistry and AI.

The approach provides a template for conditional generation in other domains where the generative process can be coupled with powerful forward simulation for error correction, suggesting new directions in AI for science, molecule discovery, and beyond.

Markdown Report Issue