Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning (2402.15761v3)
Abstract: Food classification is the foundation for developing food vision tasks and plays a key role in the burgeoning field of computational nutrition. Due to the complexity of food requiring fine-grained classification, recent academic research mainly modifies Convolutional Neural Networks (CNNs) and/or Vision Transformers (ViTs) to perform food category classification. However, to learn fine-grained features, the CNN backbone needs additional structural design, whereas ViT, containing the self-attention module, has increased computational complexity. In recent months, a new Sequence State Space (S4) model, through a Selection mechanism and computation with a Scan (S6), colloquially termed Mamba, has demonstrated superior performance and computation efficiency compared to the Transformer architecture. The VMamba model, which incorporates the Mamba mechanism into image tasks (such as classification), currently establishes the state-of-the-art (SOTA) on the ImageNet dataset. In this research, we introduce an academically underestimated food dataset CNFOOD-241, and pioneer the integration of a residual learning framework within the VMamba model to concurrently harness both global and local state features inherent in the original VMamba architectural design. The research results show that VMamba surpasses current SOTA models in fine-grained and food classification. The proposed Res-VMamba further improves the classification accuracy to 79.54\% without pretrained weight. Our findings elucidate that our proposed methodology establishes a new benchmark for SOTA performance in food recognition on the CNFOOD-241 dataset. The code can be obtained on GitHub: https://github.com/ChiShengChen/ResVMamba.
- A survey on food computing. ACM Comput. Surv., 52(5), sep 2019a. ISSN 0360-0300. doi:10.1145/3329168. URL https://doi.org/10.1145/3329168.
- Caltech-ucsd birds-200-2011 dataset. Technical report, 2011.
- Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 595–604, 2015. doi:10.1109/CVPR.2015.7298658.
- A benchmark for studying diabetic retinopathy: Segmentation, grading, and transferability. IEEE Transactions on Medical Imaging, 40(3):818–828, 2021. doi:10.1109/TMI.2020.3037771.
- 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. doi:10.1109/ICCVW.2013.77.
- Vehicle type classification using a semisupervised convolutional neural network. IEEE Transactions on Intelligent Transportation Systems, 16(4):2247–2256, 2015. doi:10.1109/TITS.2015.2402438.
- Fine-grained visual classification of aircraft. Technical report, 2013.
- E Dataset. Novel datasets for fine-grained image categorization. In First Workshop on Fine Grained Visual Categorization, CVPR. Citeseer. Citeseer. Citeseer, 2011.
- Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012. doi:10.1109/CVPR.2012.6248092.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729, 2008. doi:10.1109/ICVGIP.2008.47.
- The inaturalist species classification and detection dataset. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8769–8778, Los Alamitos, CA, USA, jun 2018. IEEE Computer Society. doi:10.1109/CVPR.2018.00914. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00914.
- Benchmarking representation learning for natural world image collections. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12879–12888, Los Alamitos, CA, USA, jun 2021. IEEE Computer Society. doi:10.1109/CVPR46437.2021.01269. URL https://doi.ieeecomputersociety.org/10.1109/CVPR46437.2021.01269.
- Food/non-food image classification and food categorization using pre-trained googlenet model. In Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, MADiMa ’16, page 3–11, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450345200. doi:10.1145/2986035.2986039. URL https://doi.org/10.1145/2986035.2986039.
- Bokun Fan. Cnfood-241. Mendeley Data, 2022. doi:10.17632/fspyss5zbb.1.
- Vmamba: Visual state space model. Technical report, 2024a.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Automatic chinese food recognition based on a stacking fusion model. In Annu Int Conf IEEE Eng Med Biol Soc., 2023. doi:10.1109/EMBC40787.2023.10340620.
- Chinesefoodnet: A large-scale image dataset for chinese food recognition. arXiv preprint arXiv:1705.02743, 2017.
- Pfid: Pittsburgh fast-food image dataset. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 289–292, 2009. doi:10.1109/ICIP.2009.5413511.
- A food image recognition system with multiple kernel learning. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 285–288, 2009a. doi:10.1109/ICIP.2009.5413400.
- Image recognition of 85 food categories by feature fusion. In 2010 IEEE International Symposium on Multimedia, pages 296–301, 2010. doi:10.1109/ISM.2010.51.
- Multiple-food recognition considering co-occurrence employing manifold ranking. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 2017–2020, 2012.
- Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. In Lourdes Agapito, Michael M. Bronstein, and Carsten Rother, editors, Computer Vision - ECCV 2014 Workshops, pages 3–17, Cham, 2015. Springer International Publishing. ISBN 978-3-319-16199-0. doi:10.1007/978-3-319-16199-0_1.
- Food-101 – mining discriminative components with random forests. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 446–461, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10599-4.
- A food recognition system for diabetic patients based on an optimized bag-of-features model. IEEE Journal of Biomedical and Health Informatics, 18(4):1261–1271, 2014. doi:10.1109/JBHI.2014.2308928.
- Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 1–6, 2015. doi:10.1109/ICMEW.2015.7169757.
- Geolocalized modeling for dish recognition. IEEE Transactions on Multimedia, 17(8):1187–1199, 2015. doi:10.1109/TMM.2015.2438717.
- A benchmark dataset to study the representation of food images. In Lourdes Agapito, Michael M. Bronstein, and Carsten Rother, editors, Computer Vision - ECCV 2014 Workshops, pages 584–599, Cham, 2015. Springer International Publishing. ISBN 978-3-319-16199-0.
- Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the 24th ACM International Conference on Multimedia, MM ’16, page 32–41, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450336031. doi:10.1145/2964284.2964315. URL https://doi.org/10.1145/2964284.2964315.
- Fine-grained image classification by exploring bipartite-graph labels. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1124–1133, 2016. doi:10.1109/CVPR.2016.127.
- Snap, eat, repeat: A food recognition engine for dietary logging. In Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, MADiMa ’16, page 31–40, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450345200. doi:10.1145/2986035.2986036. URL https://doi.org/10.1145/2986035.2986036.
- Retrieval and classification of food images. Computers in Biology and Medicine, 77:23–39, 2016. ISSN 0010-4825. doi:https://doi.org/10.1016/j.compbiomed.2016.07.006. URL https://www.sciencedirect.com/science/article/pii/S0010482516301822.
- Learning cnn-based features for retrieval of food images. In Sebastiano Battiato, Giovanni Maria Farinella, Marco Leo, and Giovanni Gallo, editors, New Trends in Image Analysis and Processing – ICIAP 2017, pages 426–434, Cham, 2017. Springer International Publishing. ISBN 978-3-319-70742-6.
- Vegfru: A domain-specific dataset for fine-grained visual categorization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 541–549, 2017. doi:10.1109/ICCV.2017.66.
- Mining discriminative food regions for accurate food recognition. In BMVC, 2019.
- Foodx-251: a dataset for fine-grained food classification. 2019. doi:10.48550/arxiv.1907.06167.
- Ingredient-guided cascaded multi-attention network for food recognition. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, page 1331–1339, New York, NY, USA, 2019b. Association for Computing Machinery. ISBN 9781450368896. doi:10.1145/3343031.3350948. URL https://doi.org/10.1145/3343031.3350948.
- Foodai: Food image recognition via deep learning for smart food logging. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 2260–2268, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi:10.1145/3292500.3330734. URL https://doi.org/10.1145/3292500.3330734.
- Tsan-Lun Yang. Taiwanese-food-101. Technical report, 2020. URL https://ieeexplore.ieee.org/document/708428. Retrieved Jan. 3, 2024.
- Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network. In Proceedings of the 28th ACM International Conference on Multimedia, 2020.
- Large scale visual food recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9932–9949, 2023. doi:10.1109/TPAMI.2023.3237871.
- The food recognition benchmark: Using deep learning to recognize food in images. Front Nutr., 2022. doi:10.3389/fnut.2022.875143.
- Learn from each other to classify better: Cross-layer mutual attention learning for fine-grained visual classification. Pattern Recognition, 140:109550, 2023. ISSN 0031-3203. doi:https://doi.org/10.1016/j.patcog.2023.109550. URL https://www.sciencedirect.com/science/article/pii/S0031320323002509.
- Quoc V. Le Mingxing Tan. Efficientnet: Rethinking model scaling for convolutional neural networks. Technical report, 2019.
- A convnet for the 2020s. Technical report, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. doi:10.1109/ICCV48922.2021.00986. URL https://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00986.
- Repvit: Revisiting mobile cnn from vit perspective. Technical report, 2023.
- Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018. doi:10.1109/CVPR.2018.00745.
- A food image recognition system with multiple kernel learning. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 285–288, 2009b. doi:10.1109/ICIP.2009.5413400.
- Real-time food intake classification and energy expenditure estimation on a mobile device. In 2015 IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN), pages 1–6, 2015. doi:10.1109/BSN.2015.7299410.
- Support vector machines. IEEE Intelligent Systems and their Applications, 13(4):18–28, 1998. doi:10.1109/5254.708428.
- Malaysian food recognition using alexnet cnn and transfer learning. In 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), pages 59–64, 2021. doi:10.1109/ISCAIE51753.2021.9431833.
- Food recognition with resnet-50. In 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), pages 1–5, 2020. doi:10.1109/IICAIET49801.2020.9257825.
- Food classification using transfer learning technique. Global Transitions Proceedings, 3(1):225–229, 2022. ISSN 2666-285X. doi:https://doi.org/10.1016/j.gltp.2022.03.027. URL https://www.sciencedirect.com/science/article/pii/S2666285X22000334. International Conference on Intelligent Engineering Approach(ICIEA-2022).
- Food image recognition using very deep convolutional networks. In Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, MADiMa ’16, page 41–49, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450345200. doi:10.1145/2986035.2986042. URL https://doi.org/10.1145/2986035.2986042.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC.
- Tri Dao Albert Gu. Mamba: Linear-time sequence modeling with selective state spaces. Technical report, 2023.
- Vision mamba: Efficient visual representation learning with bidirectional state space model. Technical report, 2024.
- Bo Wang Jun Ma, Feifei Li. U-mamba: Enhancing long-range dependency for biomedical image segmentation. Technical report, 2024.
- Suncheng Xiang Jiacheng Ruan. Vm-unet: Vision mamba unet for medical image segmentation. Technical report, 2024.
- Swin-umamba: Mamba-based unet with imagenet-based pretraining. Technical report, 2024b.
- nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. Technical report, 2024.
- Cai Meng Tao Guo, Yinuo Wang. Mambamorph: a mamba-based backbone with contrastive feature learning for deformable mr-ct registration. Technical report, 2024.
- Lei Zhu Yijun Yang, Zhaohu Xing. Vivim: a video vision mamba for medical video object segmentation. Technical report, 2024.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi:10.1109/CVPR.2016.90.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=yWd42CWN3c.
- Diagonal state spaces are as effective as structured state spaces. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=RjS0j6tsSrf.
- K Simonyan and A Zisserman. Very deep convolutional networks for large-scale image recognition. pages 1–14. Computational and Biological Learning Society, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society. doi:10.1109/CVPR.2017.243. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.243.
- Inception-v4, inception-resnet and the impact of residual connections on learning. Technical report, 2016.