Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments (2404.02508v1)

Published 3 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Individuals with visual impairments, encompassing both partial and total difficulties in visual perception, are referred to as visually impaired (VI) people. An estimated 2.2 billion individuals worldwide are affected by visual impairments. Recent advancements in multi-modal LLMs (MLLMs) have showcased their extraordinary capabilities across various domains. It is desirable to help VI individuals with MLLMs' great capabilities of visual understanding and reasoning. However, it is challenging for VI people to use MLLMs due to the difficulties in capturing the desirable images to fulfill their daily requests. For example, the target object is not fully or partially placed in the image. This paper explores how to leverage MLLMs for VI individuals to provide visual-question answers. VIAssist can identify undesired images and provide detailed actions. Finally, VIAssist can provide reliable answers to users' queries based on the images. Our results show that VIAssist provides +0.21 and +0.31 higher BERTScore and ROUGE scores than the baseline, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608–3617, 2018.
  2. WHO, “Blindness and vision impairment,” Aug 2023.
  3. Bemyeyes, “Be my eyes integrates be my aitm into its first contact center with stunning results,” Mar 2024.
  4. B. M. Eyes, “The story about be my eyes,” Mar 2024.
  5. B. Kuriakose, R. Shrestha, and F. E. Sandnes, “Deepnavi: A deep learning based smartphone navigation assistant for people with visual impairments,” Expert Systems with Applications, vol. 212, p. 118720, 2023.
  6. P.-J. Duh, Y.-C. Sung, L.-Y. F. Chiang, Y.-J. Chang, and K.-W. Chen, “V-eye: A vision-based navigation system for the visually impaired,” IEEE Transactions on Multimedia, vol. 23, pp. 1567–1580, 2020.
  7. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in NeurIPS, 2023.
  8. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
  9. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
  10. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  11. W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning, pp. 5583–5594, PMLR, 2021.
  12. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017.
  13. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
  14. W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  15. E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2021.
  16. Y. Zhao, Y. Zhang, R. Xiang, J. Li, and H. Li, “Vialm: A survey and benchmark of visually impaired assistance with large models,” arXiv preprint arXiv:2402.01735, 2024.
  17. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019.
  18. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, pp. 74–81, 2004.
  19. B. Yang, L. He, N. Ling, Z. Yan, G. Xing, X. Shuai, X. Ren, and X. Jiang, “Edgefm: Leveraging foundation model for open-set learning on the edge,” arXiv preprint arXiv:2311.10986, 2023.
  20. S. Ma, H. Wang, L. Ma, L. Wang, W. Wang, S. Huang, L. Dong, R. Wang, J. Xue, and F. Wei, “The era of 1-bit llms: All large language models are in 1.58 bits,” arXiv preprint arXiv:2402.17764, 2024.
  21. T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16344–16359, 2022.
  22. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
  23. B. Yang, W. Wu, Y. Liu, and H. Liu, “A novel sleep stage contextual refinement algorithm leveraging conditional random fields,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–13, 2022.
  24. B. Yang, X. Zhu, Y. Liu, and H. Liu, “A single-channel eeg based automatic sleep stage classification method leveraging deep one-dimensional convolutional neural network and hidden markov model,” Biomedical Signal Processing and Control, vol. 68, p. 102581, 2021.
  25. D. Jain, K. Huynh Anh Nguyen, S. M. Goodman, R. Grossman-Kahn, H. Ngo, A. Kusupati, R. Du, A. Olwal, L. Findlater, and J. E. Froehlich, “Protosound: A personalized and scalable sound recognition system for deaf and hard-of-hearing users,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–16, 2022.
  26. Y. Lin, K. Wang, W. Yi, and S. Lian, “Deep learning based wearable assistive system for visually impaired people,” in Proceedings of the IEEE/CVF international conference on computer vision workshops, pp. 0–0, 2019.
  27. R. Liu, J. Zhang, K. Peng, J. Zheng, K. Cao, Y. Chen, K. Yang, and R. Stiefelhagen, “Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual impairments,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1857–1867, 2023.
  28. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Bufang Yang (9 papers)
  2. Lixing He (5 papers)
  3. Kaiwei Liu (12 papers)
  4. Zhenyu Yan (30 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com