- The paper introduces a diffusion language model that leverages fragment-based representations and ICEBERG-guided iterative correction to enhance molecular generation accuracy from MS/MS spectra.
- FRIGID achieves log-linear scaling with increased training and inference compute, tripling Top-1 accuracy on benchmarks while significantly reducing runtime.
- The methodology integrates a BERT-style masked diffusion framework with cross-attention conditioning on molecular formulas and fingerprints to advance de novo molecular elucidation.
FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time
Introduction and Paradigm Shift in Mass Spectral Elucidation
The structural elucidation of small molecules from tandem mass spectrometry (MS/MS) is a central challenge in analytical chemistry with broad implications in metabolomics, drug discovery, and environmental science. Traditional retrieval-based methods are constrained by the coverage of chemical libraries, failing on truly novel compounds. This has driven the field toward end-to-end de novo molecular generation conditioned directly on spectral data. Autoregressive models and graph diffusion approaches have advanced the state-of-the-art, but their scalability—particularly for inference and training in ultra-large chemical spaces—remains inadequate.
FRIGID (Fragment Refinement via ICEBERG-Guided Inference Diffusion) introduces a diffusion LLM (DLM) that exploits fragment-based representations (SAFE) and operates as a masked diffusion LLM (MDLM), enabling efficient training on scale and rapid, accurate inference. Furthermore, FRIGID tightens the feedback loop with a neural simulator (ICEBERG) for targeted, iterative correction at inference, achieving unprecedented accuracy and log-linear scaling with inference compute on challenging benchmarks.
Diffusion LLM Architecture and Conditioning
FRIGID-base models the conditional distribution over fragment-based SAFE token sequences given a molecular formula and fingerprint, the latter predicted from the spectrum using MIST. The model employs a BERT-style MDLM, parameterizing denoising transitions non-autoregressively, which significantly accelerates inference compared to graph-diffusion competitors.
Conditioning is achieved via cross-attention mechanisms: the molecular formula is embedded as a count vector over common elements, and the fingerprint is represented as an active subset of Morgan bit indices, also attentively embedded. These conditioning signals enable robust generation constrained both by experimental formula and spectral features.
To address variable molecular sizes, the sequence length is predicted by a separate NGBoost regression tree ensemble conditioned on the formula, striking a balance between exploration and wasteful decoding.
Figure 1: (a) FRIGID-base utilizes a masked diffusion LLM (MDLM) for SAFE sequence generation, jointly conditioned on precursor formula and spectrum-derived fingerprint, with candidate ranking by fingerprint similarity. (b) The inference-time scaling pipeline leverages ICEBERG-consistency to selectively denoise inconsistent substructures.
Cycle-Consistent Inference-Time Scaling
Conventional diffusion models offer only limited avenues for inference-time improvement (e.g., longer denoising trajectories), which become prohibitive for complex graph domains. FRIGID innovates by integrating ICEBERG, a neural forward fragmentation simulator, to assess cycle consistency: simulated spectra from generated candidates are compared to experimental data, identifying "hallucinated" peaks not observed experimentally. This enables granular, atom-wise scoring of substructure (via fragments) as consistent or inconsistent with the spectrum, which is then aggregated to the token level for targeted remasking and re-denoising.
This targeted refinement is iterated round-after-round, with high-scoring candidates further refined, and new diversity-increasing samples interleaved. This framework generalizes both to pure sequence DLMs and graph-based models, but FRIGID's sequence backbone offers several orders of magnitude gains in practical inference throughput.
Figure 2: ICEBERG-guided inference-time scaling mechanism. Substructure inconsistencies are identified via simulated-observed spectrum discrepancy, assigned as atom-level scores, and used for recurrent targeted denoising.
Training Compute Scaling
FRIGID demonstrates log-linear scaling in accuracy with increasing volume of pretraining data, with improvements persisting from millions to a billion unlabeled structures. Even with modest compute, FRIGID surpasses previous baselines, and scaling both model and data reliably drives further gains.
Figure 3: Top-1 exact-match accuracy as a function of training compute on NPLIB1. FRIGID-base shows strong scaling, further improved by ICEBERG-guided inference-time refinement.
Benchmark Results
On both the NPLIB1 and MassSpecGym datasets, FRIGID establishes a dominant position, achieving state-of-the-art Top-1 and Top-10 exact-match accuracy, and improved maximum structural similarity (Tanimoto) and minimum MCES (graph edit distance). On NPLIB1, FRIGID triples the Top-1 accuracy of leading prior methods. On MassSpecGym, it achieves over 18% Top-1 accuracy—substantially higher than all earlier systems.
Inference runtime is another major strength: FRIGID's base model decodes a spectrum in 6.58 seconds, more than an order of magnitude faster than DiffMS. This speedup is essential to making iterative inference-time scaling feasible without prohibitive compute costs.
Inference-Time Scaling and Ablation
FRIGID's ICEBERG-guided correction provides consistent, compute-efficient improvements at inference. The increase in accuracy as a function of expended inference time is log-linear, with no saturation observed even at the highest tested compute budgets. Ablations show that random or untargeted correction strategies yield minor improvements, but the targeted, cycle-consistent mechanism is critical to FRIGID's performance.
Figure 4: Inference-time Top-1 accuracy scaling on NPLIB1 and MassSpecGym. The blue dashed line shows log-linear improvement with increased refinement.
Qualitative Analysis and Error Characterization
Qualitative examples indicate two primary classes of FRIGID success: immediate correct generation and success after several rounds of iterative correction, wherein initially hallucinated fragments are identified and removed in later rounds.
Figure 5: NPLIB1 example spectra. Rows 1–2: Immediate success; rows 3–4: correction via iterative inference leads to final discovery of the true molecule.
Failure analysis reveals both persistent inability to generate the correct structure, and rare cases where inference-time correction can worsen the rank order. The model is robust even when upstream fingerprint predictions are noisy, and close/meaningful matches (as measured by sub-structure similarity) are common even when the exact structure eludes reconstruction.
Figure 6: NPLIB1 error cases. Row 1 shows persistent failure of candidate generation; row 2 demonstrates occasional negative effects of repeated correction.
Analogous results are observed for MassSpecGym, further illustrating FRIGID's capacity for both immediate and eventual success, as well as manageably rare failure cases.
Figure 7: MassSpecGym examples: rows 1–2 are immediate successes; rows 3–4 are refined by iterative correction; row 5 is a failure case after multiple rounds.
Theoretical and Practical Implications, and AI Perspective
FRIGID demonstrates the power of sequence-based diffusion language modeling—previously validated in text and general molecular design—for the complex, under-constrained problem of spectral structural elucidation. Key theoretical insights include:
- Fragment-centric conditioning robustly constrains space while supporting tractable large-scale training.
- Cycle-consistent inference-time refinement reshapes generative search, utilizing dual neural models for effective semantic correction.
- Log-linear scaling trends in both training and inference are reminiscent of LLM scaling laws [kaplan2020scalinglawsneurallanguage], reinforcing the transferability of these principles to the molecular domain.
Practically, this makes high-throughput, de novo annotation of untargeted MS/MS spectra feasible at scale, even when the correct structure lies far outside known chemical libraries. The backbone is modular: alternative spectral fingerprint predictors, spectral simulators, or representations (e.g., SELFIES) can be readily integrated. The framework is amenable to further optimizations, including combined test-time adaptation and even more advanced feedback mechanisms integrating other spectral modalities.
Future directions may include training on richer multi-modal data, expanded fragment ontologies for spectral simulation, or combining with retrieval/candidate enumeration pipelines for open-world metabolomics.
Conclusion
FRIGID sets a new benchmark for spectrum-to-molecule generation: it achieves higher accuracy, unmatched practical efficiency, and strong robustness to prediction errors. The combination of scalable masked diffusion language modeling with cycle-consistent, simulator-guided iterative correction constitutes a conceptual advance in molecular generative modeling, with implications for both theory and practice across chemistry and AI.
The approach provides a template for conditional generation in other domains where the generative process can be coupled with powerful forward simulation for error correction, suggesting new directions in AI for science, molecule discovery, and beyond.