Open Food Facts: Global Packaged Food Database

Updated 24 April 2026

Open Food Facts is an open-access database containing over 900,000 food products, each annotated with detailed nutritional and processing metadata including NOVA classifications.
The dataset undergoes rigorous preprocessing such as nutrient panel stratification, mean and autoencoder imputation, and z-score normalization to support robust machine-learning applications.
State-of-the-art models like LightGBM, Random Forest, and CatBoost are deployed on OFF data, linking ultra-processed foods to poorer nutritional profiles, higher additive counts, and increased environmental burdens.

Open Food Facts (OFF) is an open-access, global database encompassing over 900,000 packaged food products, systematically annotated with structured metadata, including the NOVA classification for degrees of food processing. OFF serves as a primary resource for large-scale computational studies addressing the intersection of food composition, health implications, and environmental sustainability. Researchers have leveraged OFF to construct scalable machine-learning pipelines for automated NOVA classification and to analyze associations between food processing, nutritional quality, environmental impact, and allergenicity (Arora et al., 19 Dec 2025).

1. OFF Dataset Characteristics and Preprocessing

The OFF dataset represents a comprehensive registry of packaged food products, including extensive nutrient panels and categorical labels. For machine learning applications targeting NOVA classification, data preprocessing conforms to the following protocol:

NOVA Label Filtering: Selection is restricted to entries bearing explicit NOVA class labels (1–4), yielding an initial experimental cohort of 875,075 products.
Nutrient Panel Stratification: Three nutrient panels were constructed:
- 44-nutrient panel, harmonized with USDA FNDDS 2009–10, with missing values addressed via mean or autoencoder (AEC) imputation.
- 8-nutrient core panel: Energy (kcal), Protein (g), Total Fat (g), Carbohydrate (g), Sugars (g), Saturated Fat (g), Sodium (mg), Dietary Fiber (g). Entries with >41% missingness were excluded.
- 7-nutrient reduced panel: identical to the 8-nutrient panel, excluding fiber. Here, a 15% missingness threshold was applied.
Feature Cleaning and Normalization: Nutrient features are z-score normalized:

$x' = \frac{x - \mu}{\sigma}$

where $\mu$ and $\sigma$ are the sample mean and standard deviation, applied per nutrient.

Final Class Distribution in 8-Nutrient Panel:

| NOVA class | Product count | |---------------------------|---------------| | NOVA 1 (minimally processed) | 81,298 | | NOVA 2 (processed culinary) | 28,865 | | NOVA 3 (processed) | 151,229 | | NOVA 4 (ultra-processed) | 415,181 |

This curation enables robust downstream analysis by ensuring a large, well-characterized, and minimally imputed dataset for each processing class (Arora et al., 19 Dec 2025).

2. Machine-Learning Methodologies for NOVA Classification

State-of-the-art supervised learning algorithms are applied to OFF nutrient data for multi-class NOVA prediction:

Random Forest (RF): Ensemble of decision trees, advantageous for high-dimensional, noisy datasets.
LightGBM (LGBM): Gradient-boosted decision tree framework optimized for computational and memory efficiency; minimizes the regularized objective:

$\mathcal{L} = \sum_{i=1}^N \ell(y_i, \hat y_i) + \sum_{k=1}^K \Omega(f_k)$

where $\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_j w_j^2$ (with $T$ as number of leaves, $w_j$ as leaf weights).

CatBoost (CB): Gradient boosting with specialized categorical encoding and ordered boosting to limit overfitting.

Hyperparameter optimization utilizes scikit-learn’s RandomizedSearchCV over 50–100 configurations, using stratified 5-fold cross-validation, with the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance. Representative optimal parameters for LGBM in the 8-nutrient model include a boosting type of ‘gbdt’, 31 max leaves, unlimited tree depth, learning rate 0.05, 500 estimators, subsample and colsample ratios of 0.8, and regularization coefficients set to 0.1.

3. Model Evaluation: Metrics and Performance

Performance is reported using multi-class metrics:

Accuracy: $\displaystyle \frac{TP + TN}{TP+TN+FP+FN}$
Precision: $\displaystyle \frac{TP}{TP+FP}$
Recall: $\displaystyle \frac{TP}{TP+FN}$
F1-score: $\mu$ 0
Matthews Correlation Coefficient (MCC): Reflects class-imbalance sensitivity.

Sensitivity analysis across nutrient panels is shown below:

Panel (nutrients)	Model	Accuracy	F1-score	MCC
7 (dropped miss.)	LGBM	0.84	0.84	0.69
	RF	0.84	0.83	0.68
	CB	0.80	0.79	0.61
8 (dropped miss.)	LGBM	0.85	0.84	0.69
	RF	0.84	0.83	0.68
	CB	0.81	0.80	0.63
44 (mean-impute)	LGBM	0.80	0.79	0.62
	RF	0.80	0.79	0.63
	CB	0.78	0.76	0.58

The highest accuracy and F1-score (85%, 0.84) are achieved by LightGBM on the 8-nutrient, drop-missing dataset, with strong generalization (train–validation accuracy/F1 gap <2% over 10 folds). The confusion matrix reveals that model errors are most prevalent between adjacent NOVA classes, particularly NOVA 2 and 3, while discrimination between NOVA 1 and 4 is robust (Arora et al., 19 Dec 2025).

4. Exploratory Analysis: Nutritional, Environmental, and Allergenic Correlates

OFF’s curation allows multifactorial analysis of processed foods. The major findings are:

NOVA vs. Nutri-Score: Among 677,073 products, a moderate association is observed (Cramér’s V = 0.289, Spearman’s ρ = +0.40). The prevalence of poor Nutri-Grades (D/E) is much higher in NOVA 4 (33.11% and 23.84%) than in NOVA 1 (4.98% and 3.30%), signifying that ultra-processed products are disproportionately associated with poor nutritional quality.
NOVA vs. Eco-Score: Analyzed across 415,502 products, there is a weak association (Cramér’s V = 0.112, Spearman’s ρ = –0.06), with NOVA 4 skewed toward less favorable Eco-Scores (C/D).
Carbon Footprint: For 385 products with life-cycle assessment (LCA) data, higher NOVA class has a weak positive correlation ( $\mu$ 1) with higher CO₂e emissions.
Allergen Prevalence: Out of 270,088 products, the prevalence of allergens (notably gluten and milk) is substantial in NOVA 4 (gluten 25%, milk 34%), with a small-to-moderate association (Cramér’s V = 0.134).
Additives and Food Categories: Additive count is positively correlated with processing level (Spearman’s ρ = +0.42, Kruskal-Wallis H= 375,971.8). NOVA 4 is disproportionately represented by categories “Cakes,” “Snacks,” and “Prepared meals,” forming hubs in the food-category network.

This suite of analyses substantiates the linkage between industrial food processing, degraded nutritional quality, increased environmental burden, and heightened allergenic potential (Arora et al., 19 Dec 2025).

5. Key Insights and Implications

The application of machine learning to OFF nutritional data demonstrates that nutrient concentration alone suffices for robust, scalable NOVA classification (accuracy ~85%). Ultra-processed foods (NOVA 4) are linked to impaired Nutri-Scores, greater additive loads, higher carbon footprints, and elevated prevalence of major allergens. These findings underscore the epidemiological and environmental implications of food processing degree, with broad relevance to regulatory, public health, and sustainability domains.

A plausible implication is that nutrient-based automated classification can serve as a proxy for identifying ultra-processed foods at scale, informative for both researchers and policy-makers. Furthermore, the correlations highlighted between processing, nutrition, environmental footprint, and allergenicity suggest that interventions in food system design could yield compounded health and environmental benefits (Arora et al., 19 Dec 2025).

6. Web Tool for NOVA Prediction and Stakeholder Access

A web-accessible NOVA classification tool, available at https://cosylab.iiitd.edu.in/foodlabel/, operationalizes these findings. It accepts per-100 g nutrient profiles (mandatory: Energy, Protein, Total Fat, Carbohydrate, Sugars, Saturated Fat, Sodium; optional: Fiber), returning predicted NOVA class (1–4), associated class-probability vector, and, where desired, SHAP-based feature importance explanations. By leveraging the largest NOVA-annotated dataset available, the tool provides industry, researchers, and consumers with standardized, transparent, and scalable food processing assessment (Arora et al., 19 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Application of machine learning to predict food processing level using Open Food Facts (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open Food Facts (OFF).