Open Food Facts Dataset Overview
- Open Food Facts is a global, crowd-sourced repository aggregating detailed information on packaged foods from over 150 countries.
- Its schema includes structured nutrient data, NOVA classifications, allergen flags, and environmental scores for comprehensive food analysis.
- The dataset supports advanced ML pipelines using imputation and normalization techniques to achieve high-accuracy food processing classification.
The Open Food Facts (OFF) dataset is a globally crowd-sourced, open-access repository of packaged food products, providing granular compositional, processing, environmental, and allergenic metadata for domain-scale research in food science, nutrition, and computational modeling. It underpins large-scale studies on food processing classification, including applications of machine learning for health and environmental risk stratification, most notably in NOVA-based ultra-processed food assessment (Arora et al., 19 Dec 2025, Ispirova et al., 20 May 2025).
1. Structure, Scope, and Metadata
OFF aggregates barcoded packaged foods from over 150 countries, with database snapshots as of late 2025 containing approximately 4 million entries and supporting >40 interface languages (Ispirova et al., 20 May 2025). Focused analyses commonly rely on filtered, quality-controlled subsets; for instance, a recent large-scale ML study employed up to 900,000 products (NOVA-labeled, completeness-filtered) (Arora et al., 19 Dec 2025). Product categorization employs a multi-level (parent:child:sub-child) hierarchy: major categories include Beverages (~160,000), Snacks (~120,000), Cereals & Potato Dishes (~100,000), with other large groups in Dairy, Bakery, Confectionery, and Prepared Meals.
The OFF schema is defined by:
- Structured fields: Nutrient panel (“nutriments”, per 100g), NOVA class (“nova_group” 1–4), Nutri-Score, Eco-Score, number and tags of declared additives, declared allergens (14 EU-regulated).
- Unstructured fields: Product name, free-text ingredient list, user comments, and image URLs.
- Environmental metadata: Carbon footprint (g CO₂e/100g), Eco-Score (A–E).
Product inclusion for analytic pipelines mandates completeness of selected fields (e.g., non-null ingredient lists, NOVA labels) and compliance with standardization (all nutrients to grams/100g; categorical fields normalized to English) (Ispirova et al., 20 May 2025, Arora et al., 19 Dec 2025).
2. Compositional and Classification Fields
All nutrients are harmonized to a reference per 100 g edible portion. The main structured nutritional features include: energy (kcal), total fat, saturated fat, carbohydrate, total sugars, protein, dietary fiber, and sodium. For more detailed modeling, extended panels (up to 44 nutrients) are curated through imputation strategies.
Key categorical fields:
- NOVA class: Integer (1–4), reflecting food processing per Monteiro et al. (2019): NOVA 1 (unprocessed/minimally processed), NOVA 2 (culinary ingredients), NOVA 3 (processed), NOVA 4 (ultra-processed).
- Nutri-Score: Discrete A–E mapping, nutritional quality (A best, E worst).
- Eco-Score: Environmental impact (A least, E greatest).
Allergens and additive counts are provided both as binary vectors (presence of any of 14 regulated allergens) and as integer counts of “E-number” tagged additives.
3. Data Cleaning, Imputation, and Preprocessing
Quality control and preprocessing protocols follow multilayered filtering and transformation:
- Completeness filtering: Only products with valid NOVA group, minimum nutrient panel, and standard-formatted categorical fields are retained.
- Imputation: For nutrient data with missingness, imputation modes include global mean fill (for heavily incomplete wide panels) and feed-forward autoencoder reconstruction, minimizing mean squared error:
- Scaling: Nutrient values undergo per-feature z-score normalization:
- Categorical encoding: Categorical variables are one-hot encoded for most models (except CatBoost, which supports categorical input natively).
In text-based pipelines, free-text ingredients and product names are lowercased, punctuation-stripped, and normalized for tokenization (Ispirova et al., 20 May 2025).
4. Food Processing Classification and Modeling
OFF’s value is particularly prominent in ML-based food processing classification. The dominant framework, NOVA labeling, is assigned by community contributors via ingredient-list parsing and manual rules. OFF's NOVA labels correlate strongly with academic standards: for example, a 7-nutrient panel–based model achieved 81.2% accuracy when validated on the FNDDS reference (Arora et al., 19 Dec 2025).
Recent research pipelines rely on multiple architectures:
- Random Forests (FoodProX): Trained on 11-nutrient panels, sometimes augmented by number_of_additives, delivering AUCs of up to 0.99 for NOVA 1.
- Gradient boosting (LightGBM, CatBoost) and plain Random Forests: Used for large-scale tabular panels (Arora et al., 19 Dec 2025).
- LLM pipelines (BERT/BioBERT): Product data fused into single textual prompts embedding nutrients and ingredients, transformed into [CLS] token vectors (dimension 768), then classified by tree or neural network downstream pipelines (Ispirova et al., 20 May 2025).
Model evaluation utilizes standard metrics: accuracy, precision, recall, F1-score, MCC. Cross-validation (5-fold) and stratified hold-outs are employed for assessment. For AUC (ROC) and AUP (precision-recall), FoodProX (with additives) achieves AUCs near 0.99 for NOVA 1/4 and 0.97 (NOVA 3) (Ispirova et al., 20 May 2025). LLM-based fusion slightly improves borderline case performance, at higher computational cost.
A continuous “FPro” processing score is defined as:
with and the model-predicted probabilities for NOVA 1 and 4 (Ispirova et al., 20 May 2025).
5. Statistical Distributions and Label Prevalence
OFF’s nutrient distributions across NOVA classes reveal systematic gradients:
- Energy (kcal/100g, mean ± SD): rises from 320 ± 150 (NOVA 1) to 460 ± 220 (NOVA 4).
- Total sugars: 4.5 ± 6.0 (NOVA 1), 18.0 ± 10.0 (NOVA 4).
- Total fat: 2.5 ± 3.0 (NOVA 1), 14.0 ± 8.0 (NOVA 4).
- Fiber: declines with processing, 3.5 ± 2.5 (NOVA 1), 1.2 ± 1.8 (NOVA 4).
Class distribution in quality-controlled subsets (e.g., 8-nutrient, n=681,950): NOVA 1 (26.5%), NOVA 2 (8.8%), NOVA 3 (22.0%), NOVA 4 (42.8%) (Arora et al., 19 Dec 2025). A plausible implication is that ultra-processed foods dominate both product variety and volume in packaged food ecosystems.
6. Environmental and Allergen Attributes
OFF enables linkage of food processing with environmental and allergen risk:
- Eco-Score: Composite index from carbon footprint, water use, biodiversity, country-of-origin, and packaging. For example, Ecological grade E products (>160 g CO₂e/100g) are overrepresented in NOVA 4.
- Carbon Footprint: Higher in processed foods; NOVA 1 median ≈250 g CO₂e/100g, NOVA 4 ≈350 g CO₂e/100g (correlation ρ≈0.12, ).
- Allergens: Prevalence increases with processing (mean 1.3 allergens per NOVA 4 product, 0.4 in NOVA 1). Most common in NOVA 4: milk (34%), gluten (25%), nuts (14%), peanuts (9%) (Arora et al., 19 Dec 2025).
7. Data Access, Licensing, and Update Model
OFF is distributed via full database dumps (JSON/CSV), SQLite, and RESTful API, publicly accessible and maintained under the Open Database License (ODbL v1.0) (Ispirova et al., 20 May 2025). The database is continuously updated, with daily propagation of new product submissions and error corrections. All code, derived models (e.g., FoodProX), and preprocessing scripts cited in the machine learning literature are openly available, exemplified by the Menichetti Lab repositories (Ispirova et al., 20 May 2025).
Table: Representative Data Fields in OFF
| Field Type | Example Field Name(s) | Description/Unit |
|---|---|---|
| Nutrient | protein_100g | g per 100 g edible portion |
| Processing Label | nova_group | 1–4 (NOVA class) |
| Additives | additives_n, additives_tags | Integer, string E-numbers |
| Allergen Flags | allergens | 14 binary indicators |
| Front-of-pack Score | nutrition_grades | Nutri-Score (A–E) |
| Environmental Metric | eco_score, carbon_footprint_100g | A–E, g CO₂e/100 g |
| Free-text | product_name, ingredients_text | Unstructured string |
The Open Food Facts dataset thus functions as a foundational resource for global food informatics, supporting scalable, transparent studies of nutrition, processing, and associated health and climate risks through integrative machine learning and data science pipelines (Arora et al., 19 Dec 2025, Ispirova et al., 20 May 2025).