Papers
Topics
Authors
Recent
Search
2000 character limit reached

Open Food Facts Dataset

Updated 27 February 2026
  • Open Food Facts is a global open-access food database that aggregates nutritional profiles, ingredient lists, and product metadata across 80+ languages.
  • It supports research by providing standardized, multimodal data preprocessed for machine learning pipelines using methods like random forests and transformer-based models.
  • The dataset drives studies in nutritional epidemiology, food processing classification, and sustainability assessment with reproducible metrics.

Open Food Facts (OFF) is a global, open-access database of branded food products, developed and maintained by a decentralized network of volunteers. Leveraged extensively in food informatics, OFF provides a rich, multimodal resource for large-scale, machine-learning–driven classification tasks and research on food processing, nutritional profiling, allergenic risk, sustainability labeling, and computational public health (Ispirova et al., 20 May 2025, Arora et al., 19 Dec 2025). OFF underpins state-of-the-art AI pipelines for food processing classification, including the training and benchmarking of models such as FoodProX (random forest), LightGBM, CatBoost, and transformer-based approaches using BERT/BioBERT. Its combination of structured label-derived nutritional records, ingredient lists, NOVA labels, additive flags, and associated metadata supports reproducible, scalable research across quantitative nutrition, data science, and epidemiology.

1. Database Structure and Content

OFF comprises more than three million uniquely labeled food records, collected via scan applications and web uploads, with participation spanning over 80 languages (Ispirova et al., 20 May 2025). Each entry contains:

  • Identifiers: Barcode (GTIN), OFF UUID.
  • Metadata: Product name, brand, hierarchical category structure, and region of sale.
  • Nutritional composition: Up to 150 attributes per entry, including energy (kcal), protein, total fat, saturated fat, trans fat, carbohydrates, total sugars, fiber, sodium, cholesterol, calcium, and iron—typically per 100 g or 100 mL.
  • Ingredient information: Free-text ingredient lists and additive tags (E-numbers, INS codes).
  • Labels and risk scores: NOVA class (1–4, denoting minimally processed to ultra-processed foods), Nutri-Score (A–E), Eco-Scores, and 14 EFSA allergen flags.
  • Media: Front-of-pack and packaging images.
  • Provenance: All records distributed under CC BY-SA 3.0.

Access is facilitated through CSV/JSON dumps (monthly per-country or global), a real-time REST API, and a GraphQL endpoint. The system’s crowd-sourced nature results in variable data quality, including missing/malformed fields (such as decimal notation), inconsistent additive nomenclature, and occasional misassignment of NOVA labels.

2. Data Preprocessing and Quality Control

OFF data require rigorous standardization and curation prior to use in ML pipelines (Ispirova et al., 20 May 2025, Arora et al., 19 Dec 2025). Preprocessing includes:

  • Nutrient Standardization: All concentrations mapped to grams or milligrams per 100 g; decimal formats unified (e.g., “1,2 g” to “1.2 g”); and extreme values (>10× 99th percentile) dropped.
  • Ingredient List Cleaning: Punctuation, trademark markers, and clarifying parentheticals are removed; ingredients lower-cased and synonyms mapped through lookup tables (e.g., “sodium chloride” ↔ “salt”).
  • Additive and Allergen Detection: Tagging of E-numbers/INS codes; allergen declarations split into binary flags.
  • Unit Harmonization: Verification of kcal/kJ conversions.
  • Handling Missingness and Duplicates: Records missing >2 key nutrients (for 11- or 8-nutrient panels) are dropped; duplicates removed using a combination of barcode, brand, and product name.
  • NOVA Label Cleaning: Entries with conflicting NOVA class assignments are flagged or excluded.
  • Imputation Strategies: For broader panels (44 nutrients), missingness addressed via global mean or autoencoder-based imputation (x=(xμ)/σx'=(x-\mu)/\sigma), with hyperparameters optimized via Bayesian search and training for 500 epochs with Adam optimization (Arora et al., 19 Dec 2025).

Outliers are typically retained when representing physiologically plausible extremes, preserving informative class boundaries.

3. Feature Representation and Model Inputs

OFF enables multiple representational modes, targeting both structured tabular and unstructured text domains (Ispirova et al., 20 May 2025):

  • Nutrient Vectors: For models such as FoodProX and LightGBM, the input is xRdx\in\mathbb{R}^d, with dd ranging from 7 (energy, total fat, saturated fat, carbohydrates, sugars, protein, sodium) to 44 nutrients, optionally including additive counts.
  • Textual Features: Ingredient lists and descriptions are templated and tokenized (BERT WordPiece), producing embedding vectors (hCLSR768h_{CLS}\in\mathbb{R}^{768}) via pretrained BERT and BioBERT, without fine-tuning of the transformer backbone.
  • Metadata Encoding: Category hierarchies one-hot–encoded over the 50 most prevalent labels; Nutri-Score/Eco-Score encoded ordinally for correlation analysis.

Dimensionality reduction for visualization employs UMAP to project embedding spaces to three dimensions.

4. Machine Learning Pipelines and Scoring Systems

Machine learning workflows using OFF data fall primarily into two paradigms: nutrient-driven tabular models and transformer-based text classifiers.

  • FoodProX (Random Forest): Receives 11- or 12-dimensional nutrient vectors; stratified 5-fold cross-validation; main hyperparameters—n_estimators, max_depth, criterion (gini), and class_weight (balanced). Outputs a probability vector p=[p1,p2,p3,p4]p=[p_1,p_2,p_3,p_4] over the NOVA classes. Continuous score FPro is defined:

FProk=p4kp1k+p4k,FProk[0,1]FPro_k = \frac{p_{4_k}}{p_{1_k} + p_{4_k}},\quad FPro_k\in[0,1]

with FPro0FPro\approx 0 denoting minimal processing and FPro1FPro\approx 1 indicating ultra-processing (Ispirova et al., 20 May 2025).

  • Boosted Trees (LightGBM, CatBoost): Input xRd\mathbf{x}\in\mathbb{R}^d (7–44 nutrients); hyperparameters (e.g., number of leaves, L2 penalty) tuned via RandomizedSearchCV. The multiclass log-loss objective includes regularization:

L=i=1Nk=14yiklnpik+λt=1Twt2+γT\mathcal{L} = -\sum_{i=1}^N \sum_{k=1}^4 y_{ik}\ln p_{ik} + \lambda\sum_{t=1}^T \|w_t\|^2 + \gamma T

  • Transformer-based Models: Product texts are encoded via pretrained BERT/BioBERT, embeddings input to a random forest, XGBoost, or a shallow NN (1 hidden layer, ReLU, dropout 0.5, softmax output).
  • Multimodal Fusion (Proposed): Tabular and embedding features concatenated and classified via MLP with cross-entropy loss and L2/dropout regularization.

5. Evaluation Metrics and Performance Benchmarks

Classification is evaluated via stratified k-fold cross-validation and external validation on independent datasets (e.g., FNDDS) (Arora et al., 19 Dec 2025). Key metrics include:

  • Accuracy: Acc=1Ni=1N1(y^i=yi)Acc = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat y_i = y_i)
  • Precision, Recall, F1 per class for comprehensive classwise assessment.
  • Matthews correlation coefficient (MCC):

MCC=kCkkCkCkCkr(sCrs)(sCsr)MCC = \frac{\sum_{k}\sum_{\ell} C_{kk}C_{\ell\ell} - \sum_{k\neq \ell} C_{k\ell}C_{\ell k}}{\sqrt{\prod_{r} (\sum_s C_{rs})(\sum_s C_{sr})}}

Performance highlights:

Model & AUC (NOVA1–4) & Accuracy (LGBM) F1 (LGBM) MCC (LGBM)
FoodProX (11 nutrients) 0.988–0.948
FoodProX (+additives) 0.993–0.980
LightGBM (8 nutrients) 0.85
LightGBM (44 nutrients, mean impute) 0.80
LightGBM (44 nutrients, AE impute) 0.68
BERT+RF 0.994–0.978
BioBERT+NN 0.995–0.978

Random forest and transformer-based models with as few as 7–8 nutrients achieve 80–85% accuracy and high discrimination between minimally and ultra-processed foods.

6. Exploratory Analyses and Research Applications

OFF enables large-scale investigations into the intersection of food composition, processing, health, and sustainability (Arora et al., 19 Dec 2025):

  • Associations with Nutrition and Environment: Higher NOVA classes predict lower Nutri-Scores (Spearman’s ρ = +0.40), weakly correlate with lower Eco-Scores (ρ = –0.06), and have increasing carbon footprints (ρ = +0.12, Kruskal–Wallis p<0.05).
  • Allergen Distribution: NOVA 4 (ultra-processed) items are more likely to contain gluten (25%) and milk (34%), affecting sensitive populations.
  • Category Structure: Ultra-processed products dominated by cakes, snacks, and biscuits; minimally processed foods cluster in fruits, dairy, fermented foods. Number of additives and ingredients are both strongly explanatory for NOVA status (AUC > 0.9 in some cases).
  • Semantic Embedding Trajectories: LLM-based embeddings reveal categorical progressions in processing—e.g., from raw to processed onion preparations via UMAP visualizations.
  • Continuous Processing Scoring: FPro distributions show wide intra-category variance (e.g., breakfast cereals), enabling finer stratification for epidemiological or pricing analysis.

A deployed web tool (cosylab.iiitd.edu.in/foodlabel/) enables on-the-fly NOVA prediction from nutrient entries, leveraging the LGBM classifier and providing explanatory output for users (Arora et al., 19 Dec 2025).

7. Limitations, Challenges, and Prospective Directions

Notwithstanding its breadth, OFF’s open schema introduces challenges: inconsistent user-submitted data, non-standardized ingredient and additive entries, and occasional NOVA mislabeling necessitate extensive curation (Ispirova et al., 20 May 2025). Nutrient records sometimes mix per-serving and per-100 g units; language tags and product naming are inconsistent across localizations. Data missingness for micronutrients is non-negligible (up to 99%). Current published work excludes image data, although OFF’s inventory contains packaging and product photos suitable for future multimodal classifier fusion via convolutional neural networks.

Future research directions highlighted include deployment of fully multimodal AI models, refinement of continuous processing metrics (such as FPro), and international harmonization of NOVA labeling practices. The scale and diversity of OFF positions it uniquely for research at the interface of nutritional epidemiology, computational public health, environmental impact assessment, and food system policy (Ispirova et al., 20 May 2025, Arora et al., 19 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open Food Facts (OFF) Dataset.