FoodLMM: A Versatile Food Assistant using Large Multi-modal Model (2312.14991v2)
Abstract: Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks. Nevertheless, the performance of general LMMs in specific domains is still far from satisfactory. This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities, including food recognition, ingredient recognition, recipe generation, nutrition estimation, food segmentation and multi-round conversation. To facilitate FoodLMM to deal with tasks beyond pure text output, we introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks. We adopt a two-stage training strategy. In the first stage, we utilize multiple public food benchmarks for multi-task learning by leveraging the instruct-following paradigm. In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain. Our fine-tuned FoodLMM achieves state-of-the-art results across several food benchmarks. We will make our code, models and datasets publicly available.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- Food ingredients recognition through multi-label learning. In New Trends in Image Analysis and Processing–ICIAP 2017: ICIAP International Workshops, WBICV, SSPandBE, 3AS, RGBD, NIVAR, IWBAAS, and MADiMa 2017, Catania, Italy, September 11-15, 2017, Revised Selected Papers 19, pages 394–402. Springer, 2017.
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
- The importance of zinc in human nutrition and estimation of the global prevalence of zinc deficiency. Food and Nutrition Bulletin, 22(2):113–125, 2001.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the 24th ACM international conference on Multimedia, pages 32–41, 2016.
- The use of a food logging app in the naturalistic setting fails to provide accurate measurements of nutrients and poses usability challenges. Nutrition, 57:208–216, 2019a.
- Zero-shot ingredient recognition by multi-relational graph convolutional network. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10542–10550, 2020a.
- A study of multi-task and region-wise deep learning for food ingredient recognition. IEEE Transactions on Image Processing, 30:1514–1526, 2020b.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
- Cross-modal recipe retrieval with rich food attributes. In Proceedings of the 25th ACM international conference on Multimedia, pages 1771–1779, 2017a.
- Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the 26th ACM international conference on Multimedia, pages 1020–1028, 2018.
- National health and nutrition examination survey, 2015- 2018: sample design and estimation procedures. 2020c.
- Chinesefoodnet: A large-scale image dataset for chinese food recognition. arXiv preprint arXiv:1705.02743, 2017b.
- Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5157–5166, 2019b.
- Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5177–5186, 2019c.
- Fire: Food image to recipe generation. arXiv preprint arXiv:2308.14391, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16321–16330, 2021.
- Windows attention based pyramid network for food segmentation. In 2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS), pages 213–217. IEEE, 2021.
- Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In European Conference on Computer Vision, pages 153–168. Springer, 2020.
- A new large-scale food image segmentation dataset and its application to food calorie estimation based on grains of rice. In Proceedings of the 5th international workshop on multimedia assisted dietary management, pages 82–87, 2019.
- Exploiting food choice biases for healthier recipe recommendation. In Proceedings of the 40th international acm sigir conference on research and development in information retrieval, pages 575–584, 2017.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
- Chef transformer. https://huggingface.co/flax-community/t5-recipe-generation, Accessed on Nov 22, 2023.
- Myfood: A food segmentation and classification system to aid nutritional monitoring. In 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 234–239. IEEE, 2020.
- Intelligent food planning: personalized recipe recommendation. In Proceedings of the 15th international conference on Intelligent user interfaces, pages 321–324, 2010.
- Dynamic mixup for multi-label long-tailed food ingredient recognition. IEEE Transactions on Multimedia, 2022.
- Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
- Recipegpt: Generative pre-training based cooking recipe generation and evaluation system. In Companion Proceedings of the Web Conference 2020, pages 181–184, 2020.
- Food image analysis: Segmentation, identification and weight estimation. In 2013 IEEE international conference on multimedia and expo (ICME), pages 1–6. IEEE, 2013.
- Using behavioural and motivational thinking in food segmentation. International Journal of Retail & Distribution Management, 35(9):691–702, 2007.
- Unseen food segmentation. In Proceedings of the 2022 International Conference on Multimedia Retrieval, pages 19–23, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Segmentation from natural language expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 108–124. Springer, 2016.
- See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891, 2019.
- Beyond one-to-one: Rethinking the referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4067–4077, 2023.
- Multi-scale multi-view deep feature aggregation for food recognition. IEEE Transactions on Image Processing, 29:265–276, 2019.
- Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR, 2023.
- Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. In Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part III 13, pages 3–17. Springer, 2015.
- Deep learning approaches in food recognition. Machine Learning Paradigms: Advances in Deep Learning-based Technological Applications, pages 83–108, 2020.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Food log by analyzing food images. In Proceedings of the 16th ACM international conference on Multimedia, pages 999–1000, 2008.
- Do better imagenet models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2661–2671, 2019.
- Anis Koubaa. Gpt-4 vs. gpt-3.5: A concise showdown. 2023.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Foodsam: Any food segmentation. arXiv preprint arXiv:2308.05938, 2023.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2018.
- Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6), 2023c.
- Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. In Inclusive Smart Cities and Digital Health: 14th International Conference on Smart Homes and Health Telematics, ICOST 2016, Wuhan, China, May 25-27, 2016. Proceedings 14, pages 37–48. Springer, 2016.
- Food and ingredient joint learning for fine-grained recognition. IEEE transactions on circuits and Systems for Video Technology, 31(6):2480–2493, 2020.
- Gres: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592–23601, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Ingredient prediction via context learning network with class-adaptive asymmetric loss. IEEE Transactions on Image Processing, 2023a.
- Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023b.
- Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 630–645, 2018.
- Wide-slice residual networks for food recognition. In 2018 IEEE Winter conference on applications of computer vision (WACV), pages 567–576. IEEE, 2018.
- Nutrinet: a deep learning food and drink image recognition system for dietary assessment. Nutrients, 9(7):657, 2017.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
- Food recommendation: Framework, existing solutions, and challenges. IEEE Transactions on Multimedia, 22(10):2659–2671, 2019a.
- Ingredient-guided cascaded multi-attention network for food recognition. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1331–1339, 2019b.
- Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network. In Proceedings of the 28th ACM International Conference on Multimedia, pages 393–401, 2020.
- Large scale visual food recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- National health and nutrition examination survey: estimation procedures, 2007-2010. Vital and health statistics. Series 2, Data evaluation and methods research, (159):1–17, 2013.
- Modular graph transformer networks for multi-label image classification. In Proceedings of the AAAI conference on artificial intelligence, pages 9092–9100, 2021.
- Is saki# delicious? the food perception gap on instagram and its relation to health. In Proceedings of the 26th International Conference on World Wide Web, pages 509–518, 2017.
- Uec-foodpix complete: A large-scale food image segmentation dataset. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part V, pages 647–659. Springer, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Mining discriminative food regions for accurate food recognition. arXiv preprint arXiv:2207.03692, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 82–91, 2021.
- Pay attention to the activations: A modular attention mechanism for fine-grained image recognition. IEEE Transactions on Multimedia, 22(2):502–514, 2019.
- Foodai: Food image recognition via deep learning for smart food logging. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2260–2268, 2019.
- Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3020–3028, 2017.
- Inverse cooking: Recipe generation from food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10453–10462, 2019.
- Transferring knowledge for food image segmentation using transformers and convolutions. arXiv preprint arXiv:2306.09203, 2023.
- Nutrition5k: Towards automatic nutritional understanding of generic food. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8903–8911, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Privacy-preserving visual content tagging using graph transformer networks. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2299–2307, 2020.
- Structure-aware generation network for recipe generation from images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 359–374. Springer, 2020.
- Learning structural representations for recipe generation and food retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3363–3377, 2022a.
- Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022b.
- Ingredient-guided region discovery and relationship modeling for food category-ingredient prediction. IEEE Transactions on Image Processing, 31:5214–5226, 2022c.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023a.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023b.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023c.
- Distribution-balanced loss for multi-label classification in long-tailed datasets. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 162–178. Springer, 2020.
- A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122–1136, 2023d.
- A large-scale benchmark for food image segmentation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 506–515, 2021.
- Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18155–18165, 2022.
- Sequential learning for ingredient recognition from images. IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- Multi-task learning for food identification and analysis with deep convolutional neural networks. Journal of Computer Science and Technology, 31(3):489–500, 2016.
- Deep semantic dictionary learning for multi-label image classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3572–3580, 2021.
- R2gan: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11477–11486, 2019.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- A new cnn-based single-ingredient classification model and its application in food image segmentation. Journal of Imaging, 9(10):205, 2023.