Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization (2403.08239v1)
Abstract: The state recognition of the environment and objects by robots is generally based on the judgement of the current state as a classification problem. On the other hand, state changes of food in cooking happen continuously and need to be captured not only at a certain time point but also continuously over time. In addition, the state changes of food are complex and cannot be easily described by manual programming. Therefore, we propose a method to recognize the continuous state changes of food for cooking robots through the spoken language using pre-trained large-scale vision-LLMs. By using models that can compute the similarity between images and texts continuously over time, we can capture the state changes of food while cooking. We also show that by adjusting the weighting of each text prompt based on fitting the similarity changes to a sigmoid function and then performing black-box optimization, more accurate and robust continuous state recognition can be achieved. We demonstrate the effectiveness and limitations of this method by performing the recognition of water boiling, butter melting, egg cooking, and onion stir-frying.
- R. T. Chin et al., “Model-Based Recognition in Robot Vision,” ACM Computing Surveys, vol. 18, no. 1, pp. 67–108, 1986.
- B. Quintana et al., “Door detection in 3D coloured point clouds of indoor environments,” Automation in Construction, vol. 85, pp. 146–166, 2018.
- K. Kawaharazuka et al., “VQA-based Robotic State Recognition Optimized with Genetic Algorithm,” in Proceedings of the 2023 IEEE International Conference on Robotics and Automation, 2023, pp. 8306–8311.
- M. Beetz et al., “Robotic roommates making pancakes,” in Proceedings of the 2011 IEEE-RAS International Conference on Humanoid Robots, 2011, pp. 529–536.
- K. Junge et al., “Improving Robotic Cooking Using Batch Bayesian Optimization,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 760–765, 2020.
- R. Paul, “Classifying cooking object’s state using a tuned VGG convolutional neural network,” arXiv preprint arXiv:1805.09391, 2018.
- A. B. Jelodar et al., “Identifying Object States in Cooking-Related Images,” arXiv preprint arXiv:1805.06956, 2018.
- M. S. Sakib, “Cooking Object’s State Identification Without Using Pretrained Model,” arXiv preprint arXiv:2103.02305, 2021.
- K. Takata et al., “Efficient Task/Motion Planning for a Dual-arm Robot from Language Instructions and Cooking Images,” in Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2022, pp. 12 058–12 065.
- F. Li et al., “Vision-Language Intelligence: Tasks, Representation Learning, and Large Models,” arXiv preprint arXiv:2203.01922, 2022.
- K. Kawaharazuka et al., “Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors (in press),” in Proceedings of the 2023 IEEE-RAS International Conference on Humanoid Robots, 2023.
- A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” arXiv preprint arXiv:2103.00020, 2021.
- R. Girdhar et al., “ImageBind: One Embedding Space To Bind Them All,” in Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision and Pattern Recognition, 2023.
- P. Wang et al., “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework,” arXiv preprint arXiv:2202.03052, 2022.
- N. Kanazawa et al., “Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking Robot (in press),” in Proceedings of the 18th International Conference on Intellignet Autonomous Systems, 2023.
- F. Fortin et al., “DEAP: Evolutionary Algorithms Made Easy,” Journal of Machine Learning Research, vol. 13, pp. 2171–2175, 2012.
- T. B. Brown et al., “Language Models are Few-Shot Learners,” arXiv preprint arXiv:2005.14165, 2020.
- Y. Yuan et al., “An Adaptive Divergence-Based Non-Negative Latent Factor Model,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 53, no. 10, pp. 6475–6487, 2023.