Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Dataset for Crucial Object Recognition in Blind and Low-Vision Individuals' Navigation (2407.16777v1)

Published 23 Jul 2024 in cs.CV and cs.HC

Abstract: This paper introduces a dataset for improving real-time object recognition systems to aid blind and low-vision (BLV) individuals in navigation tasks. The dataset comprises 21 videos of BLV individuals navigating outdoor spaces, and a taxonomy of 90 objects crucial for BLV navigation, refined through a focus group study. We also provide object labeling for the 90 objects across 31 video segments created from the 21 videos. A deeper analysis reveals that most contemporary datasets used in training computer vision models contain only a small subset of the taxonomy in our dataset. Preliminary evaluation of state-of-the-art computer vision models on our dataset highlights shortcomings in accurately detecting key objects relevant to BLV navigation, emphasizing the need for specialized datasets. We make our dataset publicly available, offering valuable resources for developing more inclusive navigation systems for BLV individuals.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. [n. d.]. Seeing AI. https://www.microsoft.com/en-us/seeing-ai/
  2. 2015. Be My Eyes: Bringing sight to blind and low vision people. https://www.bemyeyes.com/
  3. 2022. Diagram Center. Specific Guidelines: Art, Photos & Cartoons. http://diagramcenter.org/specific-guidelines-final-draft.html#20
  4. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. International Journal of Computer Vision 126, 9 (2018), 961–972.
  5. Aira. 2018. Aira. https://aira.io/. Retrieved May 23, 2020 from https://aira.io
  6. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems 33 (2020), 25–37.
  7. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on computer vision.
  8. Tasknorm: Rethinking batch normalization for meta-learning. In International Conference on Machine Learning. PMLR, 1153–1164.
  9. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  10. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 834–848.
  11. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3213–3223.
  12. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  13. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583–11593.
  14. Lennard J. Davis. 2016. The Disability Studies Reader (5th ed.). Routledge. https://doi.org/10.4324/9781315680668
  15. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  16. The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2010), 303–338.
  17. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126–1135.
  18. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3354–3361.
  19. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
  20. Towards General Purpose Vision Systems. ArXiv abs/2104.00743 (2021).
  21. Towards General Purpose Vision Systems. Conference of Computer Vision and Pattern Recognition (CVPR) (2022).
  22. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608–3617.
  23. Simon Harper and Yeliz Yesilada. 2008. Web accessibility and guidelines. Web accessibility: A foundation for research (2008), 61–78.
  24. Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (ICCV).
  25. Perceiver IO: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021).
  26. Hernisa Kacorri. 2017. Teachable machines for accessibility. ACM SIGACCESS Accessibility and Computing 119 (2017), 10–18.
  27. People with visual impairment training personal object recognizers: Feasibility and challenges. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 5839–5849.
  28. Improving web accessibility: a study of webmaster perceptions. Computers in human behavior 20, 2 (2004), 269–288.
  29. Revisiting blind photography in the context of teachable object recognizers. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility. 83–95.
  30. The emerging professional practice of remote sighted assistance for people with visual impairments. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
  31. Opportunities for human-AI collaboration in remote sighted assistance. In 27th International Conference on Intelligent User Interfaces. 63–78.
  32. Lavis: A library for language-vision intelligence. arXiv preprint arXiv:2209.09019 (2022).
  33. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
  34. Adversarial vqa: A new benchmark for evaluating the robustness of vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2042–2051.
  35. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
  36. Simi Linton. 1998. Claiming disability: Knowledge and identity. NYU Press.
  37. Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
  38. Microsoft. 2021. Seeing AI - Talking camera app for those with a visual impairment. https://www.microsoft.com/en-us/ai/seeing-ai.
  39. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879–9889.
  40. Guiding novice web workers in making image descriptions using templates. ACM Transactions on Accessible Computing (TACCESS) 7, 4 (2015), 1–21.
  41. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12475–12486.
  42. Meredith Ringel Morris. 2020. AI and Accessibility. Commun. ACM 63, 6 (2020), 35–37.
  43. ” With most of it being pictures now, I rarely use it” Understanding Twitter’s Evolving Accessibility to Blind Users. In Proceedings of the 2016 CHI conference on human factors in computing systems. 5506–5516.
  44. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 5000–5009.
  45. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision. 4990–4999.
  46. OKO. 2023. OKO makes every intersection accessible. https://www.ayes.ai/oko
  47. OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774v2 https://arxiv.org/abs/2303.08774v2
  48. OpenAI. 2023b. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf
  49. OpenAI. 2023c. GPT-4V(ision) technical work and authors. https://openai.com/contributions/gpt-4v
  50. Designing an Online Infrastructure for Collecting AI Data From People With Disabilities. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 52–63.
  51. Assessment of semantic taxonomies for blind indoor navigation based on a shopping center use case. In Proceedings of the 14th Web for All Conference on The Future of Accessible Work. 1–4.
  52. Describing images on the web: a survey of current practice and prospects for the future. Proceedings of Human Computer Interaction International (HCII) 71, 2 (2005).
  53. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
  54. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  55. Paymon Rafian and Gordon E Legge. 2017. Remote sighted assistants for indoor location sensing of visually impaired pedestrians. ACM Transactions on Applied Perception (TAP) 14, 3 (2017), 19.
  56. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
  57. Playing for Benchmarks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017.
  58. Project sidewalk: A web-based crowdsourcing tool for collecting sidewalk accessibility data at scale. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
  59. Woosuk Seo and Hyunggu Jung. 2017. Exploring the community of blind or visually impaired people on YouTube. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility. 371–372.
  60. Woosuk Seo and Hyunggu Jung. 2018. Understanding blind or visually impaired people on youtube through qualitative analysis of videos. In Proceedings of the 2018 ACM International Conference on Interactive Experiences for TV and Online Video. 191–196.
  61. Woosuk Seo and Hyunggu Jung. 2021. Understanding the community of blind or visually impaired vloggers on YouTube. Universal Access in the Information Society 20 (2021), 31–44.
  62. Daniel J Simons and Daniel T Levin. 1997. Change blindness. Trends in cognitive sciences 1, 7 (1997), 261–267.
  63. John M Slatin and Sharron Rush. 2003. Maximum accessibility: Making your web site more usable for everyone. Addison-Wesley Professional.
  64. Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937 (2020).
  65. Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105–6114.
  66. Disability-first Dataset Creation: Lessons from Constructing a Dataset for Teachable Object Recognition with Blind and Low Vision Data Collectors. In The 23rd International ACM SIGACCESS Conference on Computers and Accessibility. 1–12.
  67. Garreth W Tigwell. 2021. Nuanced perspectives toward disability simulations from digital designers, blind, low vision, and color blind people. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
  68. Barbara Tversky. 1993. Cognitive maps, cognitive collages, and spatial mental models. In Spatial Information Theory A Theoretical Basis for GIS, Andrew U. Frank and Irene Campari (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 14–24.
  69. Matching networks for one shot learning. Advances in neural information processing systems 29 (2016).
  70. How blind people interact with visual content on social networking services. In Proceedings of the 19th acm conference on computer-supported cooperative work & social computing. 1584–1595.
  71. WAI. 2022. Web Content Accessibility Guidelines (WCAG) Overview. https://www.w3.org/WAI/standards-guidelines/wcag/
  72. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7464–7475.
  73. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43, 10 (2020), 3349–3364.
  74. T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5079–5088.
  75. Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5764–5773.
  76. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4622–4630.
  77. YouTube Videos as Data: Seeing Daily Challenges for People with Visual Impairments During COVID-19. In Proceedings of the 2022 ACM Conference on Information Technology for Social Good. 218–224.
  78. Iterative Design and Prototyping of Computer Vision Mediated Remote Sighted Assistance. ACM Transactions on Computer-Human Interaction (TOCHI) (2022). https://doi.org/10.1145/3501298
  79. Helping Helpers: Supporting Volunteers in Remote Sighted Assistance with Augmented Reality Maps. In Designing Interactive Systems Conference. 881–897.
  80. Recognize Anything: A Strong Image Tagging Model. arXiv preprint arXiv:2306.03514 (2023).
  81. The effect of computer-generated descriptions on photo-sharing experiences of people with visual impairments. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 1–22.
  82. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition. 633–641.
  83. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (2019), 302–321.
  84. Fast context adaptation via meta-learning. In International Conference on Machine Learning. PMLR, 7693–7702.

Summary

We haven't generated a summary for this paper yet.