Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Alt-Text with Context: Improving Accessibility for Images on Twitter (2305.14779v3)

Published 24 May 2023 in cs.CV, cs.CL, and cs.LG

Abstract: In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation. We show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work, by more than 2x on BLEU@4.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Flamingo: a visual language model for few-shot learning, 2022.
  2. Openflamingo, March 2023. URL https://doi.org/10.5281/zenodo.7733589.
  3. Alt-texify: A pipeline to generate alt-text from svg visualizations. In Proceedings of the 17th International Conference on Evaluation of Novel Approaches to Software Engineering - ENASE,, pp.  275–281. INSTICC, SciTePress, 2022. ISBN 978-989-758-568-5. doi: 10.5220/0010994600003176.
  4. Automatic description generation from images: A survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research, 55:409–442, 2016.
  5. Using alt text to make science twitter more accessible for people with visual impairments. Nature Communications, 11, 2020.
  6. A dataset of alt texts from hci publications: Analyses and uses towards producing more descriptive alt texts of data visualizations in scientific papers. Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility, 2022.
  7. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pp.  376–380, 2014.
  8. Multilingual named entity recognition in tweets using wikidata. In The fourth annual WeCNLP (West Coast NLP) Summit (WeCNLP), virtual. Zenodo, Oct 2020. doi: 10.5281/zenodo.7014432.
  9. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1473–1482, 2015.
  10. “it’s almost like they’re trying to hide it”: How user-provided image descriptions have failed to make twitter accessible. The World Wide Web Conference, 2019.
  11. Twitter a11y: A browser extension to make twitter images accessible. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 2020.
  12. Computer vision and conflicting values: Describing people with automated alt text. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, pp.  543–554, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384735. doi: 10.1145/3461702.3462620. URL https://doi.org/10.1145/3461702.3462620.
  13. Applying the stereotype content model to assess disability bias in popular pre-trained NLP models underlying AI-based assistive technologies. In Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022), pp.  58–65, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.slpat-1.8. URL https://aclanthology.org/2022.slpat-1.8.
  14. Adam: A method for stochastic optimization. ICLR, 2015.
  15. Context matters for image descriptions for accessibility: Challenges for referenceless evaluation metrics. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  4685–4697, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.309. URL https://aclanthology.org/2022.emnlp-main.309.
  16. Concadia: Towards image-based text generation with a purpose. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  4667–4684, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.308. URL https://aclanthology.org/2022.emnlp-main.308.
  17. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  18. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  19. Widget captioning: Generating natural language description for mobile user interface elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5495–5510, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.443. URL https://aclanthology.org/2020.emnlp-main.443.
  20. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp.  605–612, 2004.
  21. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  22. Meta, Nov 2021. URL https://tech.fb.com/artificial-intelligence/2021/01/how-facebook-is-using-ai-to-improve-photo-descriptions-for-people-who-are-blind-or-visually-impaired/.
  23. Assessing demographic bias in named entity recognition. In Proceedings of the AKBC Workshop on Bias in Automatic Knowledge Graph Construction, 2020. arXiv, 2020. doi: 10.48550/ARXIV.2008.03415. URL https://arxiv.org/abs/2008.03415.
  24. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  25. OpenAI. Gpt-4 technical report, 2023.
  26. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  27. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  28. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  31. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1238. URL https://aclanthology.org/P18-1238.
  32. Learning to generate images with perceptual similarity metrics. In 2017 IEEE International Conference on Image Processing (ICIP), pp.  4277–4281. IEEE, 2017.
  33. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575, 2015.
  34. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3156–3164, 2015.
  35. Automatic alt-text: Computer-generated image descriptions for blind users on a social network service. Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, 2017.
  36. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com