Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Making Short-Form Videos Accessible with Hierarchical Video Summaries (2402.10382v1)

Published 16 Feb 2024 in cs.HC

Abstract: Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form videos reported frequently skipping such inaccessible content. We present ShortScribe, a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding short-form videos. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. To evaluate ShortScribe, we assessed description accuracy and conducted a user study with 10 BLV participants comparing ShortScribe to a baseline interface. When using ShortScribe, participants reported higher comprehension and provided more accurate summaries of video content.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. 1998. Netflix. https://www.netflix.com
  2. 2000. FFMPEG SceneDetect. https://ffmpeg.org/
  3. 2004. Vimeo. https://vimeo.com
  4. 2005. YouTube. https://www.youtube.com
  5. 2010. Instagram. https://www.instagram.com
  6. 2016. Google VideoIntellligence API. https://cloud.google.com/video-intelligence
  7. 2016. TikTok. https://www.tiktok.com
  8. 2021. Thanks 1 Billion! — TikTok Newsroom. https://newsroom.tiktok.com/en-us/1-billion-people-on-tiktok. Accessed: 2023-08-25.
  9. Google 2023. Bard - Chat Based AI Tool from Google. Google. https://bard.google.com
  10. 2023. GPT-4 Technical Report. https://arxiv.org/abs/2303.08774
  11. The New York Times 2023. Hear the Dance: Audio Description Comes of Age. The New York Times. https://www.nytimes.com/2023/11/11/arts/dance/dance-and-audio-description.html?smid=nytcore-ios-share&referringSource=articleShare&fbclid=IwAR2VHe5olqM9AjsbiHAGwIfx77rzrQK84rYYVSb6tXoREMXcEufYGvCkdVI
  12. Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger Ilya Sutskever Alec Radford, Jong Wook Kim. 2021. Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2301.12597
  13. Laura Almo. [n. d.]. Why is it called “Foley” anyway? https://web.archive.org/web/20180613090128/http://cinemontage.org/2016/02/called-foley-anyway/. Accessed: 2023-12-11.
  14. How Teens with Visual Impairments Take, Edit, and Share Photos on Social Media. 1–12.
  15. Carmen J Branje and Deborah I Fels. 2012. Livedescribe: can amateur describers create high-quality audio description? Journal of Visual Impairment & Blindness 106, 3 (2012), 154–165.
  16. Web content accessibility guidelines (WCAG) 2.0. WWW Consortium (W3C) 290 (2008), 1–34.
  17. CineAD: a system for automated audio description script generation for the visually impaired. Universal Access in the Information Society 19, 1 (2020), 99–111.
  18. OmniScribe: Authoring Immersive Audio Descriptions for 360 Videos. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–14.
  19. Short-Form Videos Degrade Our Capacity to Retain Intentions: Effect of Context Switching On Prospective Memory. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–15.
  20. Describing dances: Increasing access for blind and visually impaired audiences. Dancers Group ([n. d.]). https://dancersgroup.org/2019/03/describing-dances-increasing-access-for-blind-and-visually-impaired-audiences/
  21. PoseScript: 3D human poses from natural language. In European Conference on Computer Vision. Springer, 346–362.
  22. Chasing play on TikTok from populations with disabilities to inspire playful and inclusive technology design. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
  23. Anita Fidyka and Anna Matamala. 2018. Audio description in 360º videos: Results from focus groups in Barcelona and Kraków. Translation Spaces 7, 2 (2018), 285–303. https://doi.org/10.1075/ts.18018.fid
  24. Towards computer-vision software tools to increase production and accessibility of video description for people with vision loss. Universal Access in the Information Society 8, 3 (2009), 199–218.
  25. Making GIFs Accessible. In Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility. 1–10.
  26. Making memes accessible. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility. 367–376.
  27. Twitter A11y: A browser extension to make Twitter images accessible. In Proceedings of the 2020 chi conference on human factors in computing systems. 1–12.
  28. Humans in 4D: Reconstructing and Tracking Humans with Transformers. arXiv preprint arXiv:2305.20091 (2023).
  29. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European conference on computer vision (ECCV). 771–787.
  30. Infosonics: Accessible Infographics for People Who Are Blind Using Sonification and Voice. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery.
  31. Beyond the cane: describing urban scenes to blind people for mobility tasks. ACM Transactions on Accessible Computing (TACCESS) 15, 3 (2022), 1–29.
  32. Cocomix: Utilizing Comments to Improve Non-Visual Webtoon Accessibility. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–18.
  33. GenAssist: Making Image Generation Accessible. arXiv preprint arXiv:2307.07589 (2023).
  34. AVscript: Accessible Video Editing with Audio-Visual Scripts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.
  35. W3 Web Accessibility Initiative. [n. d.]. Transcripts. https://www.w3.org/WAI/media/av/transcripts/. Accessed: 2023-12-11.
  36. The Smith-Kettlewell Eye Research Institute. 2022. YouDescribe. https://youdescribe.org/
  37. Beyond Audio Description: Exploring 360° Video Accessibility with Blind and Low Vision Users Through Collaborative Creation. In Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’23).
  38. Exploring the Experiences of Streamers with Visual Impairments. Proc. ACM Hum.-Comput. Interact. (2021).
  39. Silvio Savarese Steven Hoi Junnan Li, Dongxu Li. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. https://arxiv.org/abs/2301.12597
  40. Daniel Killough and Amy Pavel. 2023. Exploring Community-Driven Descriptions for Making Livestreams Accessible. In Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility. 1–13.
  41. ImageExplorer: Multi-Layered Touch Exploration to Encourage Skepticism Towards Imperfect AI-Generated Image Captions. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–15.
  42. Veronica Lewis. 2023. How To Write Video Descriptions For TikTok. Veroniiiica. https://veroniiiica.com/how-to-write-video-descriptions-for-tiktok/
  43. Hierarchical summarization for longform spoken dialog. In The 34th Annual ACM Symposium on User Interface Software and Technology. 582–597.
  44. Improving Automatic Summarization for Browsing Longform Spoken Dialog. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–20.
  45. What Makes Videos Accessible to Blind and Visually Impaired People?. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–4.
  46. CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding.
  47. A. Lucero. 2015. Using affinity diagrams to evaluate interactive prototypes. In Proceedings of INTERACT 2015: 15th IFIP TC 13 International Conference,Bamberg, Germany, September 14-18. 231–248.
  48. Understanding blind people’s experiences with computer-generated captions of social media images. In proceedings of the 2017 CHI conference on human factors in computing systems. 5988–5999.
  49. Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276–282.
  50. “I See Me Here”: Mental Health Content, Community, and Algorithmic Curation on TikTok. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.
  51. The efficacy of collaborative authoring of video scene descriptions. In The 23rd International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 1–15.
  52. Supporting Novices Author Audio Descriptions via Automatic Feedback. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–18.
  53. Accessibility Research in Digital Audiovisual Media: What Has Been Achieved and What Should Be Done Next?. In Proceedings of the 2023 ACM International Conference on Interactive Media Experiences (IMX ’23). 94–114.
  54. Maddy Caldwell (nourished.by.mads). 2023. Quinoa Salad. https://www.tiktok.com/@nourished.by.mads
  55. American Council of the Blind. 2020. American Council of the Blind, Audio Description Project, Guidelines for Audio Describers. https://www.acb.org/adp/guidelines.html.
  56. Video Sonification to Support Visually Impaired People: The VISaVIS Approach. 503–514.
  57. Jaclyn Packer and Corinne Kirchner. 1997. Who’s Watching?: A Profile of the Blind and Visually Impaired Audience for Television and Video. Vol. 11. American Foundation for the Blind New York.
  58. Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 181–190.
  59. Video digests: a browsable, skimmable format for informational lecture videos.. In UIST, Vol. 10. Citeseer, 2642918–2647400.
  60. Rescribe: Authoring and Automatically Editing Audio Descriptions. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 747–759. https://doi.org/10.1145/3379337.3415864
  61. Slidecho: Flexible non-visual exploration of presentation videos. In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility. 1–12.
  62. Say It All: Feedback for Improving Non-Visual Presentation Accessibility. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–12.
  63. Eyes-Free Yoga: An Exergame Using Depth Cameras for Blind & Low Vision Exercise. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’13). https://doi.org/10.1145/2513383.2513392
  64. Mark Rober. 2020. Backyard Squirrel Maze 1.0- Ninja Warrior Course. Youtube. https://www.youtube.com/watch?v=hFZFjoX2cGg
  65. ” Hey, Can You Add Captions?”: The Critical Infrastructuring Practices of Neurodiverse People on TikTok. Proceedings of the ACM on Human-Computer Interaction 7, CSCW1 (2023), 1–27.
  66. The Potential of a Visual Dialogue Agent In a Tandem Automated Audio Description System for Videos. In Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’23).
  67. Browsewithme: An online clothes shopping assistant for people with visual impairments. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 107–118.
  68. Going Beyond One-Size-Fits-All Image Descriptions to Satisfy the Information Wants of People Who are Blind or Have Low Vision (ASSETS ’18).
  69. Uncovering Causal Effects of Online Short Videos on Consumer Behaviors. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1–10.
  70. Vasilis Verroios and Michael Bernstein. 2014. Context trees: Crowdsourcing global understanding from local views. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 2. 210–219.
  71. Haptics in audio described movies. In 2010 IEEE International Symposium on Haptic Audio Visual Environments and Games. 1–2. https://doi.org/10.1109/HAVE.2010.5623958
  72. W3C. 2022. Audio Description (Prerecorded): Understanding SC 1.2.5. WCAG. https://www.w3.org/TR/UNDERSTANDING-WCAG20/media-equiv-audio-desc-only.html
  73. Toward automatic audio description generation for accessible videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–12.
  74. Automatic alt-text: Computer-generated image descriptions for blind users on a social network service. In proceedings of the 2017 ACM conference on computer supported cooperative work and social computing. 1180–1192.
  75. Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descriptions. IEEE Robotics and Automation Letters 3, 4 (2018), 3441–3448.
  76. Human-in-the-Loop Machine Learning to Increase Video Accessibility for Visually Impaired and Blind Users. In Proceedings of the 2020 ACM Designing Interactive Systems Conference. Association for Computing Machinery, New York, NY, USA, 47–60.
  77. Wikum: Bridging discussion forums and wikis using recursive summarization. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. 2082–2096.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets