Non-Targeted Analysis (NTA)
- Non-Targeted Analysis (NTA) is a comprehensive strategy that characterizes all molecular features in complex samples without relying on preselected analytes.
- It employs high-resolution mass spectrometry and advanced chemometric deconvolution to extract and resolve thousands of chemical features.
- NTA facilitates unbiased detection in fields like environmental monitoring, metabolomics, and toxicology to uncover novel contaminants and inform regulatory actions.
Non-Targeted Analysis (NTA) is a measurement and computational paradigm in analytical chemistry whose purpose is to comprehensively characterize the molecular composition of complex samples without a priori restriction to a predefined set of analytes. In contrast to targeted analysis, which focuses on the quantitation of known compounds, NTA leverages high-resolution mass spectrometry (HRMS), advanced chemometric deconvolution, and machine learning to detect, annotate, and prioritize both known and previously uncharacterized substances. Its applications span environmental monitoring, metabolomics, exposomics, and emergency toxicology, where regulatory and health challenges involve identification of unexpected contaminants or novel chemical stressors.
1. Foundational Principles and Motivation
NTA arises from the limitation of targeted workflows that depend on reference standards and precompiled analyte lists. Regulatory crises (e.g., melamine adulteration (Alsubaie et al., 23 Dec 2025)) have highlighted the necessity for analytical strategies capable of unbiased detection and inference. HRMS platforms (GC-MS, LC-MS, GC×GC-MS, IMS-MS, FTICR-MS) now provide the spectral resolution, dynamic range, and multiplexing capacity to generate broad, information-rich chemical fingerprints, capturing thousands to hundreds of thousands of distinct features in a single run (Cairoli et al., 2022).
The essential objectives of NTA are:
- Detection of both known and unknown compounds within complex matrices without reliance on standard inclusion.
- Statistical and/or mechanistic prioritization in the absence of unique reference entities.
- Robustness to the high dimensionality, correlation, and sparsity endemic to MS-derived data (Claggett et al., 2017).
- Reproducibility and portability required for regulatory validation and cross-laboratory concordance (Alsubaie et al., 23 Dec 2025).
2. Core Methodologies in NTA
2.1. Preprocessing and Feature Extraction
Standardized preprocessing pipelines include:
- Peak detection, alignment, adduct and isotope deconvolution (Cairoli et al., 2022, Anand et al., 2022).
- Baseline correction and retention-time alignment (e.g., Correlation Optimized Warping) (Cairoli et al., 2022, Giebelhaus et al., 2022).
- Binning of m/z axes for matrix factorization or machine-learning tasks (Anand et al., 2022).
- Blank subtraction and rigorous QC filtering to ensure sample specificity and data quality.
2.2. Automated Region-of-Interest (ROI) Selection
ROI selection is crucial for reducing data volume and directing subsequent deconvolution. The ψFRMV algorithm exemplifies unsupervised, algorithmic ROI selection using a moving-window SVD on chromatograms. For each window , the ratio (leading-to-second singular value) is interpreted under a Fisher F-distribution null, yielding a significance score . Thresholding on yields ROI masks with high sensitivity at low limits of detection (down to 10 pg on-column), effectively suppressing baseline noise and enabling robust chemometric analysis downstream (Giebelhaus et al., 2022).
2.3. Chemometric Deconvolution and Factorization
NTA often employs multivariate curve resolution (MCR), PARAFAC, and PARAFAC2 to simultaneously resolve overlapping chromatographic and spectral features. For a block , these tensor decompositions yield relative concentration profiles, mass spectra, and time courses for resolved components. Non-negativity and trilinearity constraints support physical interpretability, while advanced implementations incorporate neural nets to distinguish chemical features from artifacts (Cairoli et al., 2022).
Dimensionality reduction via non-negative matrix factorization (NMF) or similar techniques facilitates tokenization of spectra, enabling interpretable machine-learning workflows downstream (Anand et al., 2022).
2.4. Compound Identification and Annotation
Fragmentation-tree approaches (e.g., Dührkop & Böcker, “Fragmentation trees reloaded”) use combinatorial optimization under Bayesian priors and likelihoods to derive MAP estimates of molecular formulas, directly incorporating domain knowledge of gas-phase fragmentation and structural constraints. These provide improved accuracy in de novo formula assignment and structurally guided similarity search over previous scoring heuristics (Dührkop et al., 2014).
Algorithms such as iMet pair measured spectra with libraries of known compounds by matching not full structures but neighbor relations differing by minimal atomic transformations, leveraging statistical models of biochemical mass differences and MS similarity (Aguilar-Mogas et al., 2016). Standalone fragment formula annotation, as realized in ALPINAC, combines combinatorial knapsack enumeration, graph-theoretic pseudo-fragmentation, and isotopic pattern constraints to reconstruct chemical formulas in EI-HRMS even absent a molecular ion (Guillevic et al., 2021).
Pathway Activity Inference and annotation can be integrated within fully generative Bayesian models (e.g., PUMA), which jointly infer pathway activities and probabilistic metabolite identities from untargeted spectral counts, bypassing dependency on pre-annotated metabolite lists (Hosseini et al., 2019).
2.5. Standards-Free Multi-Attribute Scoring
In the absence of authentic reference standards, ISiCLE+MAME pipelines combine in silico property predictions (CCS, mass, isotopic envelope, adducts) with experimental measurements. Multi-attribute scores incorporating empirically-tuned weights over 11 (initial) criteria rank candidate formulae. High confidence can be achieved (FDR ~10% at stringent cutoffs) though moderate FNR persists due to library and platform limitations (Nuñez et al., 2018).
3. Statistical Analysis and High-Dimensional Inference
Typical NTA datasets exhibit features, samples, and strong intra-sample correlation (average ), rendering naive univariate testing (plus FDR) unreliable due to inflated false positives. Empirical evaluations show sparse multivariate models (LASSO, Elastic Net, SPLS-DA) provide superior power-to-discovery ratios, especially when . Cross-validated penalized regression and latent variable modeling are recommended, with feature selection interpreted in variable importance rather than strict hypothesis-testing terms (Claggett et al., 2017).
NTA is also integrated with knowledge bases for pathway enrichment and network analysis, with probabilistic frameworks enabling uncertainty propagation from detection through to pathway-level inference (Hosseini et al., 2019).
4. Applications and Domain-Specific Innovations
NTA has been deployed for:
- Environmental assessment: non-targeted screening in river water monitoring, using ensemble chemometric and path modeling to detect and trace compounds of known and unknown identity through spatiotemporal networks. The PARAFAC2 + Process PLS framework links chemical fingerprints to hydrological pathways, offering mechanistic prioritization for regulatory risk assessment (Cairoli et al., 2022).
- Metabolomics and exposomics: large-cohort plasma profiling with “directed NTA” strategies combining LC-MS/MS, rigorous feature deconvolution, and chemical-networking (GNPS) to expand characterized eicosanoids from ~150 to > 500, including ≥ 46 putative novel biomolecules (Watrous et al., 2018).
- Atmospheric science: GC-EI-HRMS pipelines for identification of trace halocarbons or hydrocarbons, employing database-free (algorithmic) annotation to circumvent incomplete spectral libraries (Guillevic et al., 2021).
- Drinking water and public health: dimensionality reduction and machine learning yielding reproducible global suspect lists for prioritization in follow-up targeted analysis (Anand et al., 2022).
5. Reproducibility, Portability, and Best Practice
An audit of 103 NTA tools over two decades reveals a divergence between openness (data/code sharing, 87%) and true operability (lab validation, workflow portability, 43%). The six-pillar reproducibility framework (C1–C6: validation, data, code, open formats, database integration, portability) exposes critical shortfalls in portable implementation (C6, 39%) and laboratory validation (C1, 55%). Recommendations include universal adoption of standardized formats (mzML, mzTab, InChI), workflow engine packaging (Nextflow, Snakemake, Docker/Singularity), and explicit cross-laboratory verification. Bridging these architectural gaps is essential to ensure that NTA pipelines are not merely findable and reusable but reproducibly executable and auditable for regulatory and legal endpoints (Alsubaie et al., 23 Dec 2025).
6. Limitations and Future Directions
Principal limitations include:
- Incomplete coverage of chemical and spectral space in current libraries, limiting annotation recall even with advanced deconvolution (Nuñez et al., 2018, Guillevic et al., 2021).
- Sensitivity to instrument- and matrix-specific noise characteristics (e.g., ψFRMV requires adjustment for CI/APCI, ALPINAC is limited by number and intensity of detected fragments) (Giebelhaus et al., 2022, Guillevic et al., 2021).
- Ambiguities in structural assignment due to isomeric or near-isomeric candidates—methodologies relying on neighbor similarity or probabilistic network context mitigate but do not eliminate these ambiguities (Aguilar-Mogas et al., 2016, Hosseini et al., 2019).
- Persistent reproducibility gaps, particularly in the development and deployment of portable, validated, and standardized end-to-end workflows (Alsubaie et al., 23 Dec 2025).
Current research avenues focus on:
- Integration of orthogonal measurement attributes (retention time, CCS, IR/Raman signatures) with advanced in silico prediction for standards-free identification (Nuñez et al., 2018).
- Algorithmic development for scalable, real-time Bayesian inference and deconvolution, including variational and sequential Monte Carlo extensions (Hosseini et al., 2019).
- Expansion of open, containerized workflow libraries and harmonized cross-domain protocols (BP4NTA, FAIR) to facilitate reproducible, regulatory-grade NTA.
7. Summary Table: Key Workflows and Algorithms in NTA
| Workflow/Algorithm | Key Functionality | Reference |
|---|---|---|
| ψFRMV | ROI detection via moving-window SVD/F-ratio | (Giebelhaus et al., 2022) |
| Fragmentation trees | MAP formula assignment, tree-based similarity | (Dührkop et al., 2014) |
| ALPINAC | EI-HRMS fragment annotation w/o libraries | (Guillevic et al., 2021) |
| iMet | Neighbor-based structure annotation | (Aguilar-Mogas et al., 2016) |
| ISiCLE + MAME | Standards-free multi-attribute scoring | (Nuñez et al., 2018) |
| PUMA | Pathway activity/metabolite annotation via Bayesian inference | (Hosseini et al., 2019) |
| Process PLS + PARAFAC2 | Spatiotemporal modeling of chemical profiles | (Cairoli et al., 2022) |
NTA is now the default analytical paradigm for chemical discovery and monitoring in biosciences, environmental chemistry, and regulatory toxicology, but its continued evolution depends critically on robust algorithmic frameworks, reproducibility-enabling best practices, and community adoption of interoperable, validated pipelines.