Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications (2405.11537v3)

Published 19 May 2024 in cs.RO, cs.AI, and cs.ET

Abstract: The advent of immersive Virtual Reality applications has transformed various domains, yet their integration with advanced artificial intelligence technologies like Visual LLMs remains underexplored. This study introduces a pioneering approach utilizing VLMs within VR environments to enhance user interaction and task efficiency. Leveraging the Unity engine and a custom-developed VLM, our system facilitates real-time, intuitive user interactions through natural language processing, without relying on visual text instructions. The incorporation of speech-to-text and text-to-speech technologies allows for seamless communication between the user and the VLM, enabling the system to guide users through complex tasks effectively. Preliminary experimental results indicate that utilizing VLMs not only reduces task completion times but also improves user comfort and task engagement compared to traditional VR interaction methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. J. Radianti, T. A. Majchrzak, J. Fromm, and I. Wohlgenannt, “A systematic review of immersive virtual reality applications for higher education: Design elements, lessons learned, and research agenda,” Computers & education, vol. 147, p. 103778, 2020.
  2. N. Pellas, S. Mystakidis, and I. Kazanidis, “Immersive virtual reality in k-12 and higher education: A systematic review of the last decade scientific literature,” Virtual Reality, vol. 25, no. 3, pp. 835–861, 2021.
  3. M. Soliman, A. Pesyridis, D. Dalaymani-Zad, M. Gronfula, and M. Kourmpetis, “The application of virtual reality in engineering education,” Applied Sciences, vol. 11, no. 6, p. 2879, 2021.
  4. M. Javaid and A. Haleem, “Virtual reality applications toward medical field,” Clinical Epidemiology and Global Health, vol. 8, no. 2, pp. 600–605, 2020.
  5. D. M. Hilty, K. Randhawa, M. M. Maheu, A. J. McKean, R. Pantera, M. C. Mishkind, and A. Rizzo, “A review of telepresence, virtual reality, and augmented reality applied to clinical care,” Journal of Technology in Behavioral Science, vol. 5, pp. 178–205, 2020.
  6. N. Mirchi, V. Bissonnette, R. Yilmaz, N. Ledwos, A. Winkler-Schwartz, and R. F. Del Maestro, “The virtual operative assistant: An explainable artificial intelligence tool for simulation-based training in surgery and medicine,” PloS one, vol. 15, no. 2, p. e0229596, 2020.
  7. S. Doolani, C. Wessels, V. Kanal, C. Sevastopoulos, A. Jaiswal, H. Nambiappan, and F. Makedon, “A review of extended reality (xr) technologies for manufacturing training,” Technologies, vol. 8, no. 4, p. 77, 2020.
  8. A. A. Malik, T. Masood, and A. Bilberg, “Virtual reality in manufacturing: immersive and collaborative artificial-reality in design of human-robot workspace,” International Journal of Computer Integrated Manufacturing, vol. 33, no. 1, pp. 22–37, 2020.
  9. M. Dallel, V. Havard, Y. Dupuis, and D. Baudry, “Digital twin of an industrial workstation: A novel method of an auto-labeled data generator using virtual reality for human action recognition in the context of human–robot collaboration,” Engineering applications of artificial intelligence, vol. 118, p. 105655, 2023.
  10. OpenAI, “Sora: Creating video from text,” https://openai.com/sora, 2024.
  11. Anthropic, “Introducing the next generation of claude,” https://www.anthropic.com/news/claude-3-family, 2024.
  12. D. Shah, B. Osiński, S. Levine et al., “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in Conference on robot learning.   PMLR, 2023, pp. 492–504.
  13. C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” arXiv preprint arXiv:2312.14150, 2023.
  14. J. H. Moon, H. Lee, W. Shin, Y.-H. Kim, and E. Choi, “Multi-modal understanding and generation for medical images and text via vision-language pre-training,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 12, pp. 6070–6080, 2022.
  15. J. Zhu, R. Kumaran, C. Xu, and T. Höllerer, “Free-form conversation with human and symbolic avatars in mixed reality,” in 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).   IEEE, 2023, pp. 751–760.
  16. “Real-time 3d development platform,” https://unity.com/, 2024.
  17. J. Roberts, A. Banburski-Fahey, and J. Lanier, “Surreal vr pong: Llm approach to game design,” in 36th Conference on Neural Information Processing Systems (NeurIPS 2022). https://www. microsoft. com/en-us/research/publication/surreal-vr-pong-llm-approach-to-gamedesign, 2022.
  18. R. R. Divekar*, J. Drozdal*, S. Chabot*, Y. Zhou, H. Su, Y. Chen, H. Zhu, J. A. Hendler, and J. Braasch, “Foreign language acquisition via artificial intelligence and extended reality: design and evaluation,” Computer Assisted Language Learning, vol. 35, no. 9, pp. 2332–2360, 2022.
  19. F. Rodrigues, A. Giovannelli, L. Pavanatto, H. Miao, J. C. de Oliveira, and D. A. Bowman, “Amp-it and wisdom: Improving 3d manipulation for high-precision tasks in virtual reality,” in 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).   IEEE, 2023, pp. 303–311.
  20. P. Ponomareva, D. Trinitatova, A. Fedoseev, I. Kalinov, and D. Tsetserukou, “Grasplook: a vr-based telemanipulation system with r-cnn-driven augmentation of virtual environment,” in 2021 20th International Conference on Advanced Robotics (ICAR).   IEEE, 2021, pp. 166–171.
  21. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  22. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
  23. M. Ahn, D. Dwibedi, C. Finn, M. G. Arenas, K. Gopalakrishnan, K. Hausman, B. Ichter, A. Irpan, N. Joshi, R. Julian et al., “Autort: Embodied foundation models for large scale orchestration of robotic agents,” arXiv preprint arXiv:2401.12963, 2024.
  24. A. Lykov, M. Litvinov, M. Konenkov, R. Prochii, N. Burtsev, A. A. Abdulkarim, A. Bazhenov, V. Berman, and D. Tsetserukou, “Cognitivedog: Large multimodal model based system to translate vision and language into action of quadruped robot,” in Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 2024, pp. 712–716.
  25. A. Lykov, M. Konenkov, K. F. Gbagbe, M. Litvinov, R. Peter, D. Davletshin, A. Fedoseev, O. Kobzarev, A. Alabbas, O. Alyounes et al., “Cognitiveos: Large multimodal model based system to endow any type of robot with generative ai,” arXiv preprint arXiv:2401.16205, 2024.
  26. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
  27. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mikhail Konenkov (8 papers)
  2. Artem Lykov (22 papers)
  3. Daria Trinitatova (14 papers)
  4. Dzmitry Tsetserukou (144 papers)
Citations (4)