Papers
Topics
Authors
Recent
2000 character limit reached

Uni-Food: Multimodal Food Analysis Dataset

Updated 24 November 2025
  • Uni-Food dataset is a comprehensive multimodal food dataset with 100K curated samples annotated for classification, ingredient recognition, recipe generation, and nutrition estimation.
  • It implements rigorous quality control using ChatGPT-4V and USDA spot-checking to ensure precise, balanced annotations across food categories and nutritional metrics.
  • The dataset supports advanced tasks such as visual question answering, large multi-modal model pretraining, and continual learning for robust food analysis research.

The Uni-Food dataset is a large-scale, multimodal food dataset that provides unified, comprehensive annotations for core food analysis tasks, including category classification, ingredient recognition, recipe generation, and ingredient-level nutrition estimation. Developed to address the limitations of prior food datasets—which either lack nutritional annotations or suffer from imbalanced task coverage—Uni-Food delivers nearly 100,000 curated samples, each integrating high-quality imagery, precise compositional labels, and detailed cooking instructions. It has become a central benchmark supporting advanced learning paradigms, notably visual question answering (VQA), large multi-modal model (LMM) pretraining, and multi-task or continual food analysis research (Jiao et al., 17 Jul 2024, Wu et al., 17 Nov 2025).

1. Composition, Annotation Protocols, and Dataset Structure

Uni-Food consists of approximately 100,000 food images, each sample paired with four aligned modalities:

  • A discrete food category label representing one of over 100 balanced classes, where no category exceeds 5% of the total data (empirically enforcing a long-tailed class distribution).
  • A complete ingredient list with quantities.
  • Free-form, full-text recipe instructions.
  • Ingredient-level nutritional values (mass, energy/kcal, carbohydrates, protein, fat), with dish-level aggregations.

Source images and text are extracted from Recipe1M+, ensuring data consistency and avoiding synthetic merges. Initial text annotations are passed through ChatGPT-4V—incorporating image and quantity cues—to generate per-ingredient macro-nutrient and energy values, post-processed and verified (spot-checked against USDA FoodData Central) to achieve a labeling error rate below 5%. A manually curated “gold standard” test subset (~10,000 images) is filtered for annotation completeness and class uniformity (Jiao et al., 17 Jul 2024).

Each recipe averages 7.4 ingredients; mean dish-level nutrition (mean ± σ): 550g ± 120g (mass), 430kcal ± 85kcal (energy), 52g ± 18g (carbohydrate), 20g ± 8g (protein), and 18g ± 9g (fat) per sample.

2. Dataset Splits, Scale, and Quality Control

The canonical split is as follows:

  • Training set: 80,000 samples (80%)
  • Validation set: 10,000 samples (10%)
  • Gold-standard test set: 10,000 samples (10%, expertly curated) (Jiao et al., 17 Jul 2024)

A secondary split protocol used in continual learning features 96,270 images for training and 3,254 for testing, with sequential task exposures (ingredient recognition, recipe generation, nutrition estimation). The data structure supports VQA, with each image paired with appropriate context prompts and ground-truth answers for targeted tasks (Wu et al., 17 Nov 2025).

Images have varied native resolutions from 480×640 up to 1024×1024. Ingredient lists are typically 5–15 tokens in length; recipe instructions span 50–200 tokens.

Quality control leverages parser-based post-processing and USDA spot-checking, with an explicit focus on both nutrition accuracy and distributional balancing.

3. Supported Tasks and VQA Framework

Uni-Food enables a breadth of multimodal tasks, unified under a VQA paradigm:

  • Category Classification: Map an image (with optional prompt) to a single food category.
  • Ingredient Recognition: Extract the complete set of ingredients for a dish given an image and the prompt “List the ingredients in this dish.”
  • Recipe Generation: Generate cooking instructions in response to an image and prompt “Give me the full cooking instructions.”
  • Nutrition Estimation: Predict numerical nutrition labels (e.g., calories, carbs, protein, fat) from a dish image, answering “What is the nutritional breakdown?” (Wu et al., 17 Nov 2025).

Comprehensive annotations further allow for tasks such as segmentation and fine-grained class analysis, though these are not specifically detailed in current evaluation protocols. The dataset has served as the core evaluation resource for cross-task learning, multitask vision-LLMs, and continual learning frameworks (Jiao et al., 17 Jul 2024, Wu et al., 17 Nov 2025).

4. Positioning Relative to Prior Datasets

Uni-Food addresses critical gaps left by established food datasets. Key comparative metrics (see Table 1 from (Jiao et al., 17 Jul 2024)):

Dataset category ingredient recipe nutrition
Food-101 101 K 0 0 0
Food2K 1,036 K 0 0 0
Vireo Food-251 169 K 169 K 0 0
Nutrition5K 0 20 K 0 20 K
Recipe1M 887 K 887 K 887 K 0
Recipe1M+ 13.7 M 13.7 M 887 K 42 K
Uni-Food 100 K 100 K 100 K 100 K

Whereas Recipe1M/Recipe1M+ contain abundant images and text but limited and sparse nutrition annotation, Uni-Food ensures exactly 100K complete samples across all four modalities, directly supporting balanced multi-task supervision and reducing task imbalance.

5. Scientific and Technical Utility

Uni-Food is designed expressly for holistically benchmarking large multi-modal models (LMMs) on food understanding, supporting explicit evaluation of classification, detection, text generation, and regression tasks in a unified protocol. This enables rigorous cross-task model validation and provides a platform for research into catastrophic forgetting and task interference, as demonstrated in (Wu et al., 17 Nov 2025).

Experiments using Uni-Food validate advanced architectures such as RoDE (Linear Rectified Mixture of Diverse Experts), enabling:

  • Improved multi-task performance over single-LoRA and softmax-MoE baselines: e.g., Ingredient IoU 26.86 (vs 22.59 base), Recipe BLEU 9.66 (vs 7.59), Nutrition pMAE 52.58 (vs 53.11 base) (Jiao et al., 17 Jul 2024).
  • Task-specific VQA prompts and protocol standardization.
  • Application as a continual learning benchmark, leveraging sequential exposure to core food-related tasks (Wu et al., 17 Nov 2025).

Error analysis reveals predominant challenges in visually similar ingredient discrimination (e.g., “Swiss cheese” vs. “Provolone”) and estimating nutrition in compositionally complex or mixed dishes.

6. Access, Licensing, and Best Practices

Upon publication, Uni-Food will be openly released at https://github.com/YourOrg/Uni-Food under a CC BY-NC 4.0 license, accompanied by pretrained RoDE weights for LLaVA-7B (Jiao et al., 17 Jul 2024). Standardized task prompts and code for evaluation will be provided; best practices recommend freezing backbone encoders and finetuning only the LoRA expert layers and routers. Modelers are cautioned to monitor validation metrics for all tasks to prevent over-specialization.

No additional data augmentation is mandated beyond default model preprocessing, and canonical splits are provided for both standard and continual learning regimes. Annotation and data generation protocols follow those in the source data and specified pipeline, ensuring verifiability and consistency.

7. Summary and Outlook

Uni-Food establishes a new standard for multimodal food datasets by delivering unified category, ingredient, recipe, and nutrition annotations for 100,000 natural food images, with robust quality control and balanced distribution. It directly addresses the limitations of Recipe1M-series and Nutrition5K/UMDFood-90k in annotation completeness and task coverage. Uni-Food serves as the empirical backbone for state-of-the-art visual-language modeling, especially for the food domain’s multitask, cross-modal, and continual learning challenges. Its release is expected to support further advances in automated nutritional profiling, health-related behavior modeling, and the generalizability of multi-task multi-modal learning frameworks (Jiao et al., 17 Jul 2024, Wu et al., 17 Nov 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Uni-Food Dataset.