VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation (2403.12415v1)
Abstract: This paper explores the potential of LLMs(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.
- Jonathan Donner. After access: Inclusion, development, and a more mobile Internet. MIT press, 2015.
- Smart technologies for visually impaired: Assisting and conquering infirmity of blind people using ai technologies. In 2020 12th Annual Undergraduate Research Conference on Applied Computing (URC), pages 1–4. IEEE, 2020.
- An ai-based visual aid with integrated reading assistant for the completely blind. IEEE Transactions on Human-Machine Systems, 50(6):507–517, 2020.
- Vision-based mobile indoor assistive navigation aid for blind people. IEEE transactions on mobile computing, 18(3):702–714, 2018.
- Multi-functional glasses for the blind and visually impaired: Design and development. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 67, pages 995–1001. SAGE Publications Sage CA: Los Angeles, CA, 2023.
- An evaluation of retinanet on indoor object detection for blind and visually impaired persons assistance navigation. Neural Processing Letters, 51:2265–2279, 2020.
- People with visual impairment training personal object recognizers: Feasibility and challenges. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 5839–5849, 2017.
- Object detection and recognition: using deep learning to assist the visually impaired. Disability and Rehabilitation: Assistive Technology, 16(3):280–288, 2021.
- Cnn-based object recognition and tracking system to assist visually impaired people. IEEE access, 10:14819–14834, 2022.
- Chatgpt for visually impaired and blind. Authorea Preprints, 2023.
- Driving towards inclusion: Revisiting in-vehicle interaction in autonomous vehicles. arXiv preprint arXiv:2401.14571, 2024.
- Vialm: A survey and benchmark of visually impaired assistance with large models. arXiv preprint arXiv:2402.01735, 2024.
- Feedback mechanism for blind and visually impaired: a review. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 67, pages 1748–1754. SAGE Publications Sage CA: Los Angeles, CA, 2023.
- Zero-shot object detection. In Proceedings of the European conference on computer vision (ECCV), pages 384–400, 2018.
- You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
- Ultralytics YOLO, January 2023.
- Vision-language navigation: A survey and taxonomy. Neural Computing and Applications, 36(7):3291–3316, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Interactive navigation in environments with traversable obstacles using large language and vision-language models. arXiv preprint arXiv:2310.08873, 2023.
- Improving vision-and-language navigation by generating future-view image semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10803–10812, 2023.
- Supervised abnormal event detection based on chatgpt attention mechanism. Multimedia Tools and Applications, pages 1–19, 2024.
- Towards generic anomaly detection and understanding: Large-scale visual-linguistic model (gpt-4v) takes the lead. arXiv preprint arXiv:2311.02782, 2023.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
- Performance of multimodal gpt-4v on usmle with image: Potential for imaging diagnostic support with explanations. medRxiv, pages 2023–10, 2023.
- Gpt-4: a new era of artificial intelligence in medicine. Irish Journal of Medical Science (1971-), 192(6):3197–3200, 2023.
- Harnessing llms in curricular design: Using gpt-4 to support authoring of learning objectives. arXiv preprint arXiv:2306.17459, 2023.
- Evaluating gpt-4 on impressions generation in radiology reports. Radiology, 307(5):e231259, 2023.
- Unlocking the power of generative ai models and systems such as gpt-4 and chatgpt for higher education: A guide for students and lecturers. Technical report, Hohenheim Discussion Papers in Business, Economics and Social Sciences, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models. Communications in Transportation Research, 4:100116, 2024.
- Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
- Mapgpt: Map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314, 2024.
- Langnav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889, 2023.
- Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492–504. PMLR, 2023.
- Navhint: Vision and language navigation agent with a hint generator. arXiv preprint arXiv:2402.02559, 2024.
- Is it safe to cross? interpretable risk assessment with gpt-4v for safety-aware street crossing. arXiv preprint arXiv:2402.06794, 2024.
- Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022.
- Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision, pages 350–368. Springer, 2022.
- Aligning bag of regions for open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15254–15264, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
- Video owl-vit: Temporally-consistent open-world localization in video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13802–13811, 2023.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
- Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9759–9768, 2020.
- Zsd-yolo: Zero-shot yolo detection using vision-language knowledgedistillation. arXiv preprint arXiv:2109.12066, 1(2):3, 2021.
- Yolo-world: Real-time open-vocabulary object detection. arXiv preprint arXiv:2401.17270, 2024.
- A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
- Prompt engineering for healthcare: Methodologies and applications. arXiv preprint arXiv:2304.14670, 2023.
- Llm-based smart reply (lsr): Enhancing collaborative performance with chatgpt-mediated smart reply system (acm)(draft) llm-based smart reply (lsr): Enhancing collaborative performance with chatgpt-mediated smart reply system. arXiv preprint arXiv:2306.11980, 2023.
- Better zero-shot reasoning with self-adaptive prompting. arXiv preprint arXiv:2305.14106, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.
- Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36, 2024.
- On extractive and abstractive neural document summarization with transformer language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 9308–9319, 2020.
- Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165, 2019.
- Large language models in the workplace: A case study on prompt engineering for job type classification. In International Conference on Applications of Natural Language to Information Systems, pages 3–17. Springer, 2023.
- Efficient multi-object detection and smart navigation using artificial intelligence for visually impaired people. Entropy, 22(9):941, 2020.
- Mobile assistive technologies for the visually impaired. Survey of ophthalmology, 58(6):513–528, 2013.
- Object recognition in a mobile phone application for visually impaired users. In 2013 6th International Conference on Human System Interactions (HSI), pages 479–484. IEEE, 2013.
- Improved visual-semantic alignment for zero-shot object detection. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11932–11939, 2020.
- Ali Jasim Ramadhan. Wearable smart system for visually impaired people. sensors, 18(3):843, 2018.
- The user as a sensor: navigating users with visual impairments in indoor spaces using tactile landmarks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 425–432, 2012.
- A dataset for the recognition of obstacles on blind sidewalk. Universal Access in the Information Society, 22(1):69–82, 2023.
- Incremental learning of 3d-dct compact representations for robust visual tracking. IEEE transactions on pattern analysis and machine intelligence, 35(4):863–881, 2012.
- Abnormal crowd behavior detection using social force model. In 2009 IEEE conference on computer vision and pattern recognition, pages 935–942. IEEE, 2009.
- Anomaly detection for visually impaired people using a 360 degree wearable camera. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.
- Hao Wang (1119 papers)
- Jiayou Qin (4 papers)
- Ashish Bastola (9 papers)
- Xiwen Chen (45 papers)
- John Suchanek (2 papers)
- Zihao Gong (6 papers)
- Abolfazl Razi (63 papers)