NOVA Ultra-Processed Food Assessment
- NOVA-based assessment is a data-driven framework that categorizes foods from minimally processed to ultra-processed using comprehensive nutrient profiles and textual features.
- Machine learning models like LGBM and NLP techniques, including transformer embeddings, are applied to improve classification accuracy and address subjectivity in food labeling.
- The approach enables scalable, reproducible assessments that support regulatory, nutritional epidemiology, and consumer-focused applications by mitigating manual classification challenges.
NOVA-based ultra-processed food assessment refers to computational, data-driven methods for classifying foods into discrete categories of processing (NOVA 1–4) according to the NOVA classification framework. These methods leverage nutrient composition data, ingredient and additive lists, and textual product information, employing ML and NLP to improve reproducibility and scalability beyond traditional manual assessment. The field addresses both the limitations of expert-driven classification—including subjectivity, inter-rater variability, and limited coverage—and the need for scalable, transparent labeling for epidemiological, regulatory, and consumer-facing applications (Arora et al., 2024, Arora et al., 19 Dec 2025, Ispirova et al., 20 May 2025).
1. The NOVA Framework: Definitions and Challenges
The NOVA framework partitions foods according to the degree and purpose of processing:
- NOVA 1: Unprocessed or minimally processed foods—simple treatments without substantive alteration (e.g., raw produce, plain milk)
- NOVA 2: Processed culinary ingredients—extractions or refinements from NOVA 1 foods for cooking (e.g., oils, flours, sugars)
- NOVA 3: Processed foods—addition of NOVA 2 ingredients to NOVA 1 foods via preservation or transformation (e.g., cheese, canned vegetables, bread)
- NOVA 4: Ultra-processed foods (UPFs)—industrial formulations with multiple ingredient classes, marker additives (emulsifiers, sweeteners), and processes such as hydrogenation or extrusion (e.g., sodas, packaged snacks) (Ispirova et al., 20 May 2025).
Criteria are qualitative, based on ingredient lists, the presence of industrial marker additives, and knowledge of manufacturing processes. No explicit quantitative nutrient thresholds are defined within NOVA itself. Consensus labelling remains challenging due to subjectivity: evolving definitions, inconsistent handling of borderline items (e.g., fortified yogurts), and inter-rater disagreement are recurrent obstacles. These issues reduce statistical power in nutrition epidemiology and complicate harmonization between cohorts and jurisdictions (Ispirova et al., 20 May 2025).
2. Data Sources and Feature Construction
Data for NOVA-based assessment derive from large food composition databases:
- FNDDS (USDA): 2970 unique items with 102 nutrient fields (macronutrients, vitamins, minerals, fatty acid subtypes, 37 flavonoids) (Arora et al., 2024).
- Open Food Facts (OFF): 875,075 products in the original dump; 681,950 retained after filtering for nutrient completeness. Includes nutrient concentrations per 100 g, ingredient and additive counts, allergen tags, and NOVA labels (Arora et al., 19 Dec 2025, Ispirova et al., 20 May 2025).
Feature sets are constructed in gradations:
- Full panel: up to 102 nutrients (FNDDS); 44 or 11–12 nutrients (OFF) (Arora et al., 2024, Ispirova et al., 20 May 2025).
- Reduced panels: 65 nutrients (“flavonoid-drop”), minimal 13-nutrient FDA panel (protein, carbohydrate, sugars, fiber, minerals, vitamins, energy, etc.) (Arora et al., 2024).
- OFF panels: 7- or 8-nutrient (key nutrients with high completeness), with an extended panel for models using imputation (Arora et al., 19 Dec 2025).
Additional features include numbers of additives (E-numbers present) and declared allergens.
Preprocessing involves standardization to zero mean, unit variance (for FNDDS); per-100 g normalization (inherent in OFF); imputation for missing data as appropriate; and SMOTE for class-imbalance during cross-validation (Arora et al., 2024, Arora et al., 19 Dec 2025).
3. Machine Learning and NLP Methodologies
The core ML task is multi-class classification of food items into NOVA 1–4, modeled as follows:
- Tree Ensembles:
- LightGBM (LGBM): GBDT with log-loss objective and complexity regularization; leading accuracy for full nutrient panels (e.g., F1 = 0.9411, MCC = 0.8691 on FNDDS 102-panel; accuracy ≈ 0.85 on OFF 8-panel) (Arora et al., 2024, Arora et al., 19 Dec 2025).
- Random Forest (RF): Bootstrap aggregation of decision trees with Gini impurity. RF preferred for medium nutrient panels (65 features) (Arora et al., 2024, Ispirova et al., 20 May 2025).
- CatBoost: Applied to OFF, with similar hyperparameterization but slightly lower test performance (Arora et al., 19 Dec 2025).
Hyperparameters are tuned via RandomizedSearchCV, with key ranges for n_estimators, max_depth, learning_rate, and regularization terms for LGBM/GB (Arora et al., 2024, Arora et al., 19 Dec 2025).
- Feature importance is evaluated via mean decrease in Gini (RF) or SHAP (Lundberg & Lee 2017), consistently identifying sugars, sodium, total fat, energy, carbohydrate, and dietary fiber as top predictors (Arora et al., 19 Dec 2025, Ispirova et al., 20 May 2025).
- NLP-based Models:
- Transformer embeddings (BERT, BioBERT, DistilBERT, LegalBERT, XLM-RoBERTa, GPT-2) encode food description, category, and macro class fields into 768-dim vectors.
- Embeddings are concatenated with nutrient panels and used as input to classifiers (LGBM, RF, GB).
- No fine-tuning of LLMs is conducted; the best performance is observed with NLP augmentation (e.g., 13-nutrient + GPT-2 + LGBM: F1 = 0.9583, MCC = 0.9091) (Arora et al., 2024, Ispirova et al., 20 May 2025).
- Continuous FPro Score (Editor's term):
Some approaches, such as FoodProX, deliver not only discrete NOVA predictions but also a continuous scale of processing intensity:
where and are model probabilities for NOVA 1 and NOVA 4, respectively. FPro ≈ 0 indicates unprocessed, FPro ≈ 1 strongly ultra-processed (Ispirova et al., 20 May 2025).
4. Model Performance, Evaluation, and Interpretability
Model performance is assessed using:
- Precision, Recall, F1-score, Accuracy, and MCC:
- FNDDS: Nutrient-only LGBM achieves F1 = 0.9411 (102-panel), RF F1 = 0.9388 (65-panel), Gradient Boost F1 = 0.9284 (13-panel) (Arora et al., 2024).
- OFF: LGBM achieves ≈0.84–0.85 F1 for 7/8-nutrient panels, declining for extended (44-nutrient) models with higher missingness/imputation (Arora et al., 19 Dec 2025).
- Class-averaged ROC-AUC and AUPRC show high discriminatory power for both unprocessed and ultra-processed classes (AUCs ≳ 0.98; AUPRC ≳ 0.90 in best models) (Arora et al., 2024, Ispirova et al., 20 May 2025).
Feature importance evaluations confirm that high sugars and sodium volumes drive ultra-processed predictions, while low sugars and sodium favor minimally processed labels. Additive count serves as a simple but less robust proxy for NOVA 4 (Arora et al., 19 Dec 2025, Ispirova et al., 20 May 2025).
NLP-augmented models consistently outperform nutrient-only models, particularly for higher-resolution ingredient-driven distinctions (Arora et al., 2024, Ispirova et al., 20 May 2025).
5. Large-Scale Application and Cross-Domain Associations
Deployment of NOVA-based ML models on large-scale datasets enables cross-domain analyses:
- Associations with nutritional indices:
- Higher NOVA scores correlate with lower Nutri-Scores (Spearman ρ ≈ +0.40 in OFF), reflecting poorer nutritional quality in ultra-processed foods (Arora et al., 19 Dec 2025).
- Increasing NOVA class associates with higher product carbon footprint (ρ ≈ +0.12) and lower Eco-Scores (ρ ≈ −0.06) (Arora et al., 19 Dec 2025).
- Additive and Allergen Burden:
- Explicit quantification shows NOVA 4 products contain more additives; χ² tests confirm significant association (Arora et al., 19 Dec 2025).
- Allergen analysis identifies gluten and milk as prevalent in ultra-processed products; prevalence and effect sizes are quantified via chi-square and Cramér’s V (Arora et al., 19 Dec 2025).
- Product Category Patterns:
- Community detection in OFF data shows NOVA 3/4 foods are dominated by “Snacks,” “Biscuits and cakes,” “Sweets,” and “Prepared meals,” confirming ingredient and processing-driven clustering (Arora et al., 19 Dec 2025).
6. Tool Deployment and Reproducibility
User-facing tools operationalize ML-based NOVA assessment:
- https://cosylab.iiitd.edu.in/food-processing/—Accepts nutrient panels (13, 65, 102) and optional text fields, outputs predicted NOVA class and probability vector; inference is served by serialized models via a Python Flask-based REST API (Arora et al., 2024).
- https://cosylab.iiitd.edu.in/foodlabel/—OFF-based tool: accepts 7–8 key nutrients, optional additive/allergen counts, returns NOVA class and SHAP-driven attribution for interpretability (Arora et al., 19 Dec 2025).
- All tools support batch input, allow API integration, and provide downloadable reports or JSON outputs. Probabilistic vectors support decision-threshold adjustment (e.g., for ultra-processed), and promote downstream uses in nutrition tracking or risk stratification (Arora et al., 2024, Arora et al., 19 Dec 2025).
Model and tool development prioritizes transparency (open data, code, documented splits), reproducibility (stratified cross-validation, published hyperparameters), and scalability to diverse jurisdictions and food typologies (Ispirova et al., 20 May 2025).
7. Implications and Best Practices
By integrating standardized nutrient panels, ingredient/additive features, and NLP-driven textual features, NOVA-based ultra-processed food assessment frameworks achieve high accuracy, scalability, and interpretability. The use of continuous scales such as FPro supports nuanced epidemiological analysis and risk assessment beyond categorical boundaries.
Best practices from recent literature include:
- Use regulated panels for reproducibility (e.g., FDA/EFSA); supplement with engineered features as available.
- Report probabilistic as well as discrete outputs to account for uncertainty and ambiguous cases.
- Choose models aligned with resource constraints (tree ensembles for low-resource classification, LLM-augmented models for maximal resolution).
- Maintain rigorous cross-validation and transparency in code, data selection, and labelling.
- Support continual model refinement as new mechanisms and data become available, and advocate open access for benchmarking and community comparison (Ispirova et al., 20 May 2025, Arora et al., 2024, Arora et al., 19 Dec 2025).
This suggests that future progress in NOVA-based assessment will depend on harmonizing reference panels, incorporating richer food process metadata, and balancing interpretability versus model complexity in deployment.