Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search Engine (2312.15844v1)

Published 26 Dec 2023 in cs.RO, cs.CL, and cs.CV

Abstract: Domestic service robots offer a solution to the increasing demand for daily care and support. A human-in-the-loop approach that combines automation and operator intervention is considered to be a realistic approach to their use in society. Therefore, we focus on the task of retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting, which we define as the learning-to-rank physical objects (LTRPO) task. For example, given the instruction "Please go to the dining room which has a round table. Pick up the bottle on it," the model is required to output a ranked list of target objects that the operator/user can select. In this paper, we propose MultiRankIt, which is a novel approach for the LTRPO task. MultiRankIt introduces the Crossmodal Noun Phrase Encoder to model the relationship between phrases that contain referring expressions and the target bounding box, and the Crossmodal Region Feature Encoder to model the relationship between the target object and multiple images of its surrounding contextual environment. Additionally, we built a new dataset for the LTRPO task that consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects. We validated our model on the dataset and it outperformed the baseline method in terms of the mean reciprocal rank and recall@k. Furthermore, we conducted physical experiments in a setting where a domestic service robot retrieved everyday objects in a standardized domestic environment, based on users' instruction in a human--in--the--loop setting. The experimental results demonstrate that the success rate for object retrieval achieved 80%. Our code is available at https://github.com/keio-smilab23/MultiRankIt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. H. Ishiguro, “The Realisation of an Avatar-Symbiotic Society where Everyone can Perform Active Roles without Constraint,” Advanced Robotics, vol. 35, no. 11, pp. 650–656, 2021.
  2. L. Yu et al., “MattNet: Modular Attention Network for Referring Expression Comprehension,” in CVPR, 2018, pp. 1307–1315.
  3. J. Mao, J. Huang, et al., “Generation and Comprehension of Unambiguous Object Descriptions,” in CVPR, 2016, pp. 11–20.
  4. A. Radford, J. Kim, et al., “Learning Transferable Visual Models from Natural Language Supervision,” in ICML, 2021, pp. 8748–8763.
  5. T. Liu, “Learning to Rank for Information Retrieval,” Foundations and Trends in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009.
  6. R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, “Natural Language Object Retrieval,” in CVPR, 2016, pp. 4555–4564.
  7. S. Subramanian, W. Merrill, T. Darrell, M. Gardner, S. Singh, and A. Rohrbach, “ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension,” in ACL, 2022, pp. 5198–5215.
  8. A. Baldrati, D. Morelli, G. Cartella, M. Cornia, M. Bertini, R. Cucchiara, et al., “Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing,” in ICCV, 2023.
  9. G. Cartella, A. Baldrati, D. Morelli, M. Cornia, et al., “OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data,” in ICIAP, 2023, pp. 245–256.
  10. J. Hatori, Y. Kikuchi, S. Kobayashi, K. Takahashi, Y. Tsuboi, et al., “Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions,” in ICRA, 2018, pp. 3774–3781.
  11. S. Ishikawa and K. Sugiura, “Target-Dependent UNITER: A Transformer-based Multimodal Language Comprehension Model for Domestic Service Robots,” IEEE RAL, vol. 6, no. 4, pp. 8401–8408, 2021.
  12. S. Uppal, S. Bhagat, D. Hazarika, N. Majumder, et al., “Multimodal Research in Vision and Language: A Review of Current and Emerging Trends,” Information Fusion, vol. 77, pp. 149–171, 2022.
  13. Z. Yu, J. Yu, Y. Cui, et al., “Deep Modular Co-attention Networks for Visual Question Answering,” in CVPR, 2019, pp. 6281–6290.
  14. L. Li et al., “Relation-Aware Graph Attention Network for Visual Question Answering,” in ICCV, 2019, pp. 10 313–10 322.
  15. Z. Yang, N. Garcia, et al., “BERT Representations for Video Question Answering,” in WACV, 2020, pp. 1556–1565.
  16. Y. Luo, J. Ji, et al., “Dual-Level Collaborative Transformer for Image Captioning,” in AAAI, vol. 35, no. 3, 2021, pp. 2286–2293.
  17. M. Cornia et al., “Meshed-Memory Transformer for Image Captioning,” in CVPR, 2020, pp. 10 578–10 587.
  18. P. Anderson, A. Shrivastava, et al., “Sim-to-Real Transfer for Vision-and-Language Navigation,” in CoRL, 2021, pp. 671–681.
  19. H. Wang, W. Liang, J. Shen, et al., “Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation,” in CVPR, 2022, pp. 15 471–15 481.
  20. J. Chen et al., “Reinforced Structured State-Evolution for Vision-Language Navigation,” in CVPR, 2022, pp. 15 450–15 459.
  21. T. Weyand, A. Araujo, B. Cao, and J. Sim, “Google Landmarks Dataset v2-A Large-Scale Benchmark for Instance-Level Recognition and Retrieval,” in CVPR, 2020, pp. 2575–2584.
  22. N. Vo et al., “Composing Text and Image for Image Retrieval-An Empirical Odyssey,” in CVPR, 2019, pp. 6439–6448.
  23. J. Kim, Y. Yu, et al., “Dual Compositional Learning in Interactive Image Retrieval,” in AAAI, vol. 35, no. 2, 2021, pp. 1771–1779.
  24. M. Shridhar, L. Manuelli, and D. Fox, “CliPort: What and Where Pathways for Robotic Manipulation,” in CoRL, 2022, pp. 894–906.
  25. Y. Zhou, S. Sonawani, M. Phielipp, et al., “Modularity through Attention: Efficient Training and Transfer of Language-Conditioned Policies for Robot Manipulation,” in CoRL, 2023, pp. 1684–1695.
  26. A. Magassouba et al., “Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-based Multimodal Target–Source Classification,” IEEE RAL, vol. 4, no. 4, pp. 3884–3891, 2019.
  27. S. Ishikawa and K. Sugiura, “Moment-based Adversarial Training for Embodied Language Comprehension,” in IEEE ICPR, 2022, pp. 4139–4145.
  28. D. Shah, B. Osiński, S. Levine, et al., “LM-Nav: Robotic Navigation with Large Pre-trained Models of Language, Vision, and Action,” in CoRL, 2023, pp. 492–504.
  29. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” in ICRA, 2023, pp. 10 608–10 615.
  30. X. Han, Z. Wu, P. Huang, et al., “Automatic Spatially-Aware Fashion Concept Discovery,” in ICCV, 2017, pp. 1463–1471.
  31. Z. Liu et al., “Image Retrieval on Real-Life Images with Pre-trained Vision-and-Language Models,” in CVPR, 2021, pp. 2125–2134.
  32. D. Shah, B. Eysenbach, G. Kahn, N. Rhinehart, and S. Levine, “ViNG: Learning Open-World Navigation with Visual Goals,” in ICRA, 2021, pp. 13 215–13 222.
  33. T. Brown, B. Mann, N. Ryder, M. Subbiah, et al., “Language Models are Few-shot Learners,” in NeurIPS, vol. 33, 2020, pp. 1877–1901.
  34. A. Zeng et al., “Transporter Networks: Rearranging the Visual World for Robotic Manipulation,” in CoRL, 2021, pp. 726–747.
  35. J. Sun, G. Luo, Y. Zhou, X. Sun, G. Jiang, Z. Wang, and R. Ji, “RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension,” in CVPR, 2023, pp. 19 144–19 154.
  36. Y. Chen, L. Li, et al., “UNITER: UNiversal Image-TExt Representation Learning,” in ECCV, 2020, pp. 104–120.
  37. H. Wu, Y. Gao, X. Guo, Z. Al, S. Rennie, K. Grauman, and R. Feris, “Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback,” in CVPR, 2021, pp. 11 307–11 317.
  38. S. Schuster and C. Manning, “Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks,” in LREC, 2016, pp. 2371–2378.
  39. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, et al., “Attention is All You Need,” in NIPS, vol. 30, 2017, pp. 5998–6008.
  40. Y. Qi, Q. Wu, P. Anderson, X. Wang, Y. Wang, C. Shen, and A. Hengel, “REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments,” in CVPR, 2020, pp. 9982–9991.
  41. A. Chang et al., “Matterport3D: Learning from RGB-D Data in Indoor Environments,” in IEEE 3DV, 2018, pp. 667–676.
  42. T. Yamamoto, K. Terada, A. Ochiai, et al., “Development of Human Support Robot as the Research Platform of a Domestic Mobile Manipulator,” ROBOMECH Journal, vol. 6, no. 1, pp. 1–15, 2019.
  43. B. Calli, A. Walsman, et al., “Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set,” IEEE RAM, vol. 22, no. 3, pp. 36–52, 2015.
  44. “World Robot Summit 2020 Partner Robot Challenge Real Space Rules & Regulations,” 2020.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com