Inverse Cooking: Recipe Generation from Food Images (1812.06164v2)

Published 14 Dec 2018 in cs.CV

Abstract: People enjoy food photography because they appreciate food. Behind each meal there is a story described in a complex recipe and, unfortunately, by simply looking at a food image we do not have access to its preparation process. Therefore, in this paper we introduce an inverse cooking system that recreates cooking recipes given food images. Our system predicts ingredients as sets by means of a novel architecture, modeling their dependencies without imposing any order, and then generates cooking instructions by attending to both image and its inferred ingredients simultaneously. We extensively evaluate the whole system on the large-scale Recipe1M dataset and show that (1) we improve performance w.r.t. previous baselines for ingredient prediction; (2) we are able to obtain high quality recipes by leveraging both image and ingredients; (3) our system is able to produce more compelling recipes than retrieval-based approaches according to human judgment. We make code and models publicly available.

Authors (4)

Amaia Salvador (18 papers)
Michal Drozdzal (45 papers)
Xavier Giro-i-Nieto (69 papers)
Adriana Romero (23 papers)

Citations (138)

View on Semantic Scholar

Summary

Essay on "Inverse Cooking: Recipe Generation from Food Images"

The paper "Inverse Cooking: Recipe Generation from Food Images" by Amaia Salvador et al. introduces a novel approach to generating recipes from food images. This method involves predicting ingredients and subsequently generating cooking instructions using both the image and the predicted ingredients. The research presents a significant progression from prior methods which primarily focused on image-to-recipe retrieval. The image-to-recipe problem is recast as a conditional generation task, which potentially overcomes the limitations of dataset constraints inherent in retrieval-based approaches.

The authors developed a model that consists of two main components: an ingredient prediction system based on a novel architecture and an instruction generation system that utilizes a transformer-based sequential model. The model benefits from a set representation of ingredients which allows for exploiting dependencies between them without enforcing any particular order, an improvement over traditional approaches that fail to capture these dependencies effectively.

Testing on the large-scale Recipe1M dataset, the paper reports several key findings. First, the proposed approach outperforms retrieval-based systems by a significant margin when evaluated using human judgment, with improvements in ingredient prediction accuracy and the quality of generated recipes. Specifically, the set transformer model achieved superior results in ingredient prediction, with Intersection over Union (IoU) scores reaching 32.11% and F1 scores of 48.61%. This is a notable increase compared to the retrieval baseline, highlighting the importance of modeling ingredient dependencies while avoiding any bias imposed by order-specific models. In terms of generating coherent and human-preferable recipes, the proposed system exhibits noticeable advancement over previous methods.

Several attention strategies for incorporating image and ingredients jointly in recipe generation were explored. Among these, a concatenated attention mechanism displayed the best performance by allowing greater flexibility in determining the fusion of visual and ingredient modalities.

The paper argues that the inverse cooking task introduces new challenges, such as high intra-class variability of food images and the significant transformations that food undergoes through cooking processes. By leveraging large-scale datasets and focusing on structured learning of ingredients and cooking processes, this work paves the way for more robust applications in food understanding through computational approaches.

Although the presented model focuses on specific datasets with unique preprocessing steps, the potential implications for the broader field of AI-driven culinary applications are far-reaching. Future work could explore improving generalizability to unseen data, potentially integrating additional modalities such as textual descriptions accompanying social media food posts or refining the ingredient model to account for regional differences and more nuanced taste profiles.

Overall, "Inverse Cooking: Recipe Generation from Food Images" represents a well-executed research effort that integrates computer vision and natural language processing to tackle a practical problem with sophisticated solutions. Its implications for the development of automated cooking assistant systems and augmented reality applications in culinary arts are plausible and potentially transformative for the field.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos