FoodProX (RF): NOVA Food Processing Inference
- FoodProX (RF) is a computational system that infers NOVA food processing levels from nutrient composition data using an ensemble of decision trees.
- The model preprocesses nutrient panels, employs stratified cross-validation and SMOTE to manage missing values and class imbalance, and optimizes predictions.
- It achieves high performance with F1-scores up to 0.9388 and leverages SHAP analysis for transparent feature attribution in public health applications.
FoodProX, in its Random Forest implementation, is a computational system for the inference of food processing level using only nutrient composition data. Designed to map foods to the NOVA classification (1–4), FoodProX deploys an ensemble-based model that yields both categorical predictions (NOVA classes) and continuous indices of processing, relying strictly on machine-readable nutrient vectors. The model addresses epidemiological reproducibility and public health relevance by providing consistent, scalable inference using nutrient panels routinely available from datasets such as the USDA FNDDS and Open Food Facts (Arora et al., 2024, Ispirova et al., 20 May 2025, Arora et al., 19 Dec 2025).
1. Input Data and Feature Engineering
FoodProX ingests as input a nutrient composition vector for each food item, where ranges from 13 (minimally mandated by regulatory panels) to 102 (comprehensive analytic panels). The system draws heavily from the FNDDS 2009–2010 “model foods” database and Open Food Facts (OFF), encompassing tens of thousands of foods with labeled NOVA classes (Arora et al., 2024, Ispirova et al., 20 May 2025, Arora et al., 19 Dec 2025).
Nutrient features are selected and preprocessed as follows:
- Full panel (102 nutrients): Macronutrients (energy, protein, total fat, carbohydrate, alcohol, water), a full set of vitamins and minerals, 37 flavonoids, fatty acid subclasses (saturated, MUFA, PUFA, individual chains), cholesterol, caffeine, and theobromine.
- Coarse-grained panel (65 nutrients): 37 flavonoids omitted; retains all other macros, vitamins, minerals, fatty-acid totals, and specialty analytes.
- Minimal panel (13 nutrients / FDA-mandated): Energy, protein, total fat, carbohydrate, sugars, dietary fiber, calcium, iron, sodium, cholesterol, saturated fat, potassium, and vitamin D.
For datasets with missingness, imputations may be performed (mean value imputation for certain OFF fields), and values are log-transformed and z-standardized to control for skew and range (Ispirova et al., 20 May 2025). In the case of branded-product studies, additive count features and missing-data masking may be integrated, but classic FoodProX strictly applies to complete nutrient vectors.
2. Random Forest Model Formulation
FoodProX adopts a Random Forest (RF) classifier architecture, denoted , constructed as an ensemble of binary decision trees. Each tree partitions feature space on threshold splits of single nutrients, using the Gini impurity criterion by default:
Optimal splits are selected at each node to minimize weighted child-node impurity. Tree votes are aggregated as:
yielding a probability vector for the NOVA classes. The final categorical label is assigned as .
Hyperparameters for RF (examples from (Arora et al., 2024, Ispirova et al., 20 May 2025, Arora et al., 19 Dec 2025)):
- Number of trees, : 200–500
- Maximum depth: 20
- Minimum samples per leaf: 2–5
- Features per split: 0
- Split criterion: “gini” (primary) or “entropy”
The model is implemented in scikit-learn and trained using a randomized hyperparameter grid.
3. Training Protocol and Validation
Training employs stratified 1-fold cross-validation to preserve NOVA class proportions in all folds (commonly 2 for FNDDS, 3 for OFF). Class imbalance is addressed using SMOTE (Synthetic Minority Over-Sampling Technique), applied solely to training folds to prevent leakage. Cross-validation is nested within randomized hyperparameter search to optimize F1-score (macro) or accuracy (Arora et al., 2024, Arora et al., 19 Dec 2025).
Pseudocode for the core pipeline (cf. (Arora et al., 2024)):
9
4. Model Performance and Evaluation Metrics
FoodProX demonstrates state-of-the-art performance on the multi-class NOVA assignment task. On the FNDDS dataset with the 65-nutrient panel, the Random Forest classifier achieves:
- F1-score: 0.9388
- Matthews Correlation Coefficient (MCC): 0.8648
Performance formulas:
4
5
The confusion matrix (aggregated over CV folds, 2970 samples) reveals excellent discrimination for NOVA 1 and 4, with modest confusion between adjacent classes:
| True ↓ / Predicted → | NOVA 1 | NOVA 2 | NOVA 3 | NOVA 4 |
|---|---|---|---|---|
| NOVA 1 | 180 | 5 | 8 | 2 |
| NOVA 2 | 4 | 40 | 7 | 2 |
| NOVA 3 | 12 | 6 | 330 | 14 |
| NOVA 4 | 3 | 4 | 15 | 2310 |
Evaluations on the OFF dataset with reduced nutrient panels corroborate the robustness of RF to panel size, attaining balanced macro-F1 ≈ 0.83–0.84, trailing LGBM by ≈1 percentage point (Arora et al., 19 Dec 2025).
5. Feature Attribution and Interpretation
Model interpretability is addressed via SHAP (SHapley Additive exPlanations) and mean decrease in Gini impurity. On the 65-nutrient panel, the eight most salient predictors by mean absolute SHAP value (Arora et al., 2024):
| Feature | Mean |SHAP| | |---------------|--------------| | Sodium | 0.150 | | Energy | 0.120 | | Folic acid | 0.110 | | Water | 0.100 | | Total Fat | 0.090 | | Carbohydrate | 0.085 | | Potassium | 0.080 | | Fiber, dietary| 0.075 |
Across alternate datasets (e.g. OFF, reduced to 7–8 nutrients), sugars, sodium, total fat, energy, and saturated fat consistently emerge as drivers of processing discrimination, especially between NOVA 1 and NOVA 4 (Arora et al., 19 Dec 2025, Ispirova et al., 20 May 2025).
6. Continuous Processing Score (FPro) and Web Deployment
FoodProX introduces a continuous FPro score:
6
where 7 and 8 are the predicted probabilities for NOVA 1 and 4, respectively. This scalar projects the model output onto the conceptual axis from minimally to maximally processed. FPro ≈ 0 indicates unprocessed foods, FPro ≈ 1 designates ultra-processed products, and intermediate values capture ambiguous cases (Ispirova et al., 20 May 2025).
For dissemination and public use, the trained model is serialized (with preprocessing, typically StandardScaler) and deployed as a Flask-backed web service. End users provide per-100 g nutrient vectors, which are transformed and input to the RF pipeline. The interface returns NOVA class probabilities, class assignment, and the FPro-derived “processing score” (Arora et al., 2024).
7. Context, Comparison, and Limitations
FoodProX demonstrates that RF classifiers, given suitably preprocessed and complete nutrient data, reproducibly recapitulate expert NOVA assignments at high fidelity. Other models, such as LGBM (LightGBM) and gradient boosting, slightly exceed RF on larger or minimal panels, but RF provides a transparent interpretability pipeline with SHAP/Gini-derived importances. In comparative evaluation, confusion is largely restricted to the NOVA 2/3 and 3/4 class boundaries—mirroring the ambiguity and subjectivity inherent to manual NOVA curation (Arora et al., 19 Dec 2025).
FoodProX explicitly sidesteps subjectivity and non-reproducibility by anchoring predictions to quantifiable nutrient signatures alone. A plausible implication is that further model integration with ingredient-list NLP, as in later multimodal systems, may augment performance in edge cases or when nutrient data is incomplete (Ispirova et al., 20 May 2025). Nonetheless, FoodProX (RF) establishes a reproducible, robust computational baseline for automated food processing assessment.