Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks (2403.08049v1)

Published 12 Mar 2024 in cs.HC, cs.AI, and cs.LG

Abstract: Mixed-media tutorials, which integrate videos, images, text, and diagrams to teach procedural skills, offer more browsable alternatives than timeline-based videos. However, manually creating such tutorials is tedious, and existing automated solutions are often restricted to a particular domain. While AI models hold promise, it is unclear how to effectively harness their powers, given the multi-modal data involved and the vast landscape of models. We present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. First, we distill common tutorial components by surveying existing work; then, we present an approach to identify, assemble, and evaluate AI models for component extraction; finally, we propose guidelines for designing user interfaces (UI) that support tutorial creation based on AI-generated components. We show that TutoAI has achieved higher or similar quality compared to a baseline model in preliminary user studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Meta AI. 2022a. Video Summarization. https://paperswithcode.com/task/video-summarization
  2. Meta AI. 2022b. Video Summarization. https://paperswithcode.com/task/part-of-speech-tagging
  3. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (demonstrations). 54–59.
  4. Guidelines for human-AI interaction. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–13.
  5. Thinking aloud: Dynamic context generation improves zero-shot reasoning performance of gpt-2. arXiv preprint arXiv:2103.13033 (2021).
  6. David Bitan. 2022. How to Repair a Leaking Roof. https://www.wikihow.com/Repair-a-Leaking-Roof
  7. Alan Blackwell and Thomas Green. 2003. Notational systems–the cognitive dimensions of notations framework. HCI models, theories, and frameworks: toward an interdisciplinary science. Morgan Kaufmann 234 (2003).
  8. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  9. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  10. Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–14.
  11. A unifying reference framework for multi-target user interfaces. Interacting with computers 15, 3 (2003), 289–308.
  12. Recipescape: An interactive tool for analyzing cooking instructions at scale. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1–12.
  13. Rubyslippers: Supporting content-based voice navigation for how-to videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
  14. MixT: automatic generation of step-by-step mixed media tutorials. In Proceedings of the 25th annual ACM symposium on User interface software and technology. 93–102.
  15. Moment Detection in Long Tutorial Videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2594–2604.
  16. Beyond Text Generation: Supporting Writers with Continuous Automatic Text Summaries. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–13.
  17. Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research 22 (2004), 457–479.
  18. Hugging Face. 2022. Hugging Face Transformers: OWL-ViT. Retrieved December 22, 2022 from https://huggingface.co/docs/transformers/model_doc/owlvit
  19. You only look at one sequence: Rethinking transformer in vision through object detection. Advances in Neural Information Processing Systems 34 (2021), 26183–26197.
  20. Temporal segmentation of creative live streams. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
  21. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision. 5267–5275.
  22. Sparks: Inspiration for science writing using language models. In Designing Interactive Systems Conference. 1002–1019.
  23. LaMPost: Design and Evaluation of an AI-assisted Email Writing Prototype for Adults with Dyslexia. In Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility. 1–18.
  24. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356 (2022).
  25. Friend, collaborator, student, manager: How design of an ai-driven game level editor affects creators. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–13.
  26. Align and Attend: Multimodal Summarization with Dual Contrastive Losses. arXiv preprint arXiv:2303.07284 (2023).
  27. Jane Hoffswell and Zhicheng Liu. 2019. Interactive repair of tables extracted from pdf documents on mobile devices. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–13.
  28. Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 159–166.
  29. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4565–4574.
  30. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1780–1790.
  31. Data-driven interaction techniques for improving navigation of educational videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology. 563–572.
  32. Crowdsourcing step-by-step information extraction to enhance existing how-to videos. In Proceedings of the SIGCHI conference on human factors in computing systems. 4017–4026.
  33. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 1 (2017), 32–73.
  34. Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective. arXiv preprint arXiv:2310.19233 (2023).
  35. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
  36. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10965–10975.
  37. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  38. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
  39. ConceptScape: Collaborative concept mapping for video learning. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1–12.
  40. GPTEval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv preprint arXiv:2303.16634 (2023).
  41. Novice-AI music co-creation via AI-steering tools for deep generative models. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–13.
  42. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021).
  43. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424 (2023).
  44. Dotdash Meredith. 2023. Allrecipes. https://www.allrecipes.com/
  45. Midjourney. 2021. Midjourney. Retrieved December 19, 2022 from https://www.midjourney.com/
  46. Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing. 404–411.
  47. Simple Open-Vocabulary Object Detection with Vision Transformers. arXiv preprint arXiv:2205.06230 (2022).
  48. Clip-it! language-guided video summarization. Advances in Neural Information Processing Systems 34 (2021), 13988–14000.
  49. VideoWhiz: Non-Linear Interactive Overviews for Recipe Videos.. In Graphics Interface. 15–1.
  50. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
  51. OpenAI. 2023. GPT-4V(ision) System Card. https://openai.com/research/gpt-4v-system-card
  52. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022).
  53. Video digests: a browsable, skimmable format for informational lecture videos.. In UIST, Vol. 10. Citeseer, 2642918–2647400.
  54. Sarah Perez. 2020. YouTube introduces Video Chapters to make it easier to navigate longer videos. Retrieved October 18, 2022 from https://techcrunch.com/2020/05/28/youtube-introduces-video-chapters-to-make-it-easier-to-navigate-through-longer-videos/?guccounter=1
  55. Learn Prompting. 2023. Your Guide to Communicating with Artificial Intelligence. Retrieved November 14, 2023 from https://learnprompting.org/
  56. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  57. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
  58. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
  59. DORi: discovering object relationships for moment localization of a natural language query in a video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1079–1088.
  60. LUSE: Using LLMs for Unsupervised Step Extraction in Instructional Videos. https://cveu.github.io/2023/papers/36.pdf (2023).
  61. Closing the loop: User-centered design and evaluation of a human-in-the-loop topic modeling system. In 23rd International Conference on Intelligent User Interfaces. 293–304.
  62. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5179–5187.
  63. Tomáš Souček and Jakub Lokoč. 2020. Transnet V2: an effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020).
  64. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022).
  65. Automatic generation of two-level hierarchical tutorials from instructional makeup videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
  66. On pause: How online instructional videos are used to achieve practical tasks. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
  67. Soloist: Generating mixed-initiative tutorials from existing guitar instructional videos through audio processing. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
  68. Evertutor: Automatically creating interactive guided tutorials on smartphones by user demonstration. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 4027–4036.
  69. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6847–6857.
  70. Learnersourcing subgoal labels for how-to videos. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing. 405–416.
  71. wikihow. 2023. Welcome to wikiHow, the most trusted how-to site on the internet. https://www.wikihow.com
  72. Wikipedia contributors. 2023. Wizard (software) — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Wizard_(software)&oldid=1182151261 [Online; accessed 21-November-2023].
  73. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–22.
  74. Improving Video Interfaces by Presenting Informational Units of Videos. CHI’22 Extended Abstracts. Association for Computing Machinery (2022).
  75. Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint arXiv:2302.08081 (2023).
  76. YouTube. 2023. YouTube. https://www.youtube.com/
  77. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1247–1257.
  78. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023).
  79. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848 (2023).
  80. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219 (2023).
  81. “Rewind to the Jiggling Meat Part”: Understanding Voice Control of Instructional Videos in Everyday Tasks. In CHI Conference on Human Factors in Computing Systems. 1–11.
  82. Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence.
  83. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8739–8748.
  84. Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing 30 (2020), 948–962.
  85. End-to-end dense video captioning as sequence generation. arXiv preprint arXiv:2204.08121 (2022).

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets