Explaining CLIP's performance disparities on data from blind/low vision users (2311.17315v3)
Abstract: Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which we discuss alongside a set of other possible mitigations.
- Be My Eyes. https://www.bemyeyes.com, a. Accessed: 2023-10-11.
- Be My Eyes uses GPT-4 to transform visual accessibility. https://openai.com/customer-stories/be-my-eyes, b. Accessed: 2023-10-11.
- Google Lookout. https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.reveal. Accessed: 2023-11-06.
- IncluSet. https://incluset.com/. Accessed: 2023-11-09.
- Seeing AI. https://www.microsoft.com/en-us/ai/seeing-ai. Accessed: 2023-10-11.
- Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
- A New Dataset Based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Designing disaggregated evaluations of ai systems: Choices, considerations, and tradeoffs. In AAAI/ACM Conference on AI, Ethics, and Society, 2021.
- The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis. In IEEE/CVF conference on computer vision and pattern recognition, 2022.
- Strong Baselines for Parameter Efficient Few-Shot Fine-tuning. arXiv preprint arXiv:2304.01917, 2023.
- VizWiz: Nearly real-time answers to visual questions. In Annual ACM Symposium on User Interface Software and Technology, 2010.
- Fairlearn: A toolkit for assessing and improving fairness in AI. Technical report, Microsoft, Tech. Rep. MSR-TR-2020-32, 2020.
- Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning. PMLR, 2020.
- PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Assessing image quality issues for real-world problems. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- The measure and mismeasure of fairness. Journal Machine Learning Research, 2023.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXic:2305.06500, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representation, 2020.
- CLIP2Video: Mastering video-text retrieval via image CLIP. arXiv preprint arXiv:2106.11097, 2021.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
- DataComp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
- VizWiz Grand Challenge: Answering visual questions from blind people. In IEEE/CVR conference on computer vision and pattern recognition, 2018.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- LoRa: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Social biases in nlp models as barriers for persons with disabilities. In Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2020.
- OpenCLIP, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, 2021.
- People with visual impairment training personal object recognizers: Feasibility and challenges. In CHI Conference on Human Factors in Computing Systems, 2017.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7), 2020.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014.
- The clear benchmark: Continual learning on real-world imagery. In Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
- Frozen clip models are efficient video learners. In European Conference on Computer Vision, 2022.
- Stable Bias: Analyzing societal representations in diffusion models. arXiv preprint arXiv:2303.11408, 2023.
- Image segmentation using text and image prompts. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Orbit: A real-world few-shot dataset for teachable object recognition. In IEEE/CVF International Conference on Computer Vision, 2021.
- George A Miller. WordNet: a lexical database for English. Communications of the ACM, 38(11):39–41, 1995.
- Simple open-vocabulary object detection. In European Conference on Computer Vision, 2022.
- Scaling open-vocabulary object detection. arXiv preprint arXiv:2306.09683, 2023.
- CLIPCap: CLIP prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Understanding Personalized Accessibility through Teachable AI: Designing and Evaluating Find My Things for People who are Blind or Low Vision. In International ACM SIGACCESS Conference on Computers and Accessibility, 2023.
- Social biases through the text-to-image generation lens. arXiv preprint arXiv:2304.06034, 2023.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Towards Accountable AI: Hybrid human-machine analyses for characterizing system failure. In AAAI Conference on Human Computation and Crowdsourcing, 2018.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- OpenAI. GPT-4V(ision) System Card, 2023.
- Benchmark for compositional text-to-image synthesis. In Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 2021.
- Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF conference on computer vision and pattern recognition, 2022.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Prototypical networks for few-shot learning. Advances in neural information processing systems, 2017.
- CLIP models are few-shot learners: Empirical studies on VQA and visual entailment. arXiv preprint arXiv:2203.07190, 2022.
- Language models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655, 2022.
- CLIP4caption: CLIP for video caption. In ACM International Conference on Multimedia, 2021.
- Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- Modeling techniques for machine learning fairness: A survey. arXiv preprint arXiv:2111.03015, 2021.
- Measuring representational harms in image captioning. In ACM Conference on Fairness, Accountability, and Transparency, 2022a.
- CRIS: CLIP-driven referring image segmentation. In IEEE/CVF conference on computer vision and pattern recognition, 2022b.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Daniela Massiceti (14 papers)
- Camilla Longden (2 papers)
- Agnieszka Słowik (12 papers)
- Samuel Wills (1 paper)
- Martin Grayson (2 papers)
- Cecily Morrison (4 papers)