Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ExpressEdit: Video Editing with Natural Language and Sketching (2403.17693v1)

Published 26 Mar 2024 in cs.HC and cs.AI

Abstract: Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality$-$natural language (NL) and sketching, which are natural modalities humans use for expression$-$can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (109)
  1. The performance and cognitive workload analysis of a multimodal speech and visual gesture (mSVG) UAV control interface. Robotics and Autonomous Systems 147 (Jan. 2022), 103915. https://doi.org/10.1016/j.robot.2021.103915
  2. Khan Academy. 2024. Khan Academy. https://www.khanacademy.org Accessed: 2024-01-25.
  3. Adobe. 2024a. Adobe Photoshop. https://www.adobe.com/products/photoshop.html Accessed: 2024-01-25.
  4. Adobe. 2024b. Adobe Premiere Pro. https://www.adobe.com/products/premiere.html Accessed: 2024-01-25.
  5. Remotion AG. 2024. Remotion. https://www.remotion.dev Accessed: 2024-01-25.
  6. Guidelines for Human-AI Interaction. https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/
  7. Apple. 2024. Final Cut Pro. https://www.apple.com/final-cut-pro Accessed: 2024-01-25.
  8. Text2LIVE: Text-Driven Layered Image and Video Editing. https://doi.org/10.48550/arXiv.2204.02491
  9. Tools for placing cuts and transitions in interview video. ACM Transactions on Graphics 31, 4 (July 2012), 67:1–67:8. https://doi.org/10.1145/2185520.2185563
  10. Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165v4
  11. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (April 2021), 188:1–188:21. https://doi.org/10.1145/3449287
  12. Mireille Bétrancourt and Kalliopi Benetos. 2018. Why and when does instructional video facilitate learning? A commentary to the special issue “developments and trends in learning with instructional video”. Computers in Human Behavior 89 (Dec. 2018), 471–475. https://doi.org/10.1016/j.chb.2018.08.035
  13. Diogo Cabral and Nuno Correia. 2017. Video editing with pen-based technology. Multimedia Tools and Applications 76, 5 (March 2017), 6889–6914. https://doi.org/10.1007/s11042-016-3329-y
  14. Linda Candy. 2013. Evaluating Creativity. 57–84. https://doi.org/10.1007/978-1-4471-4111-2_4
  15. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. https://doi.org/10.48550/arXiv.2303.04226
  16. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques (DIS ’02). Association for Computing Machinery, New York, NY, USA, 157–166. https://doi.org/10.1145/778712.778737
  17. StableVideo: Text-driven Consistency-aware Diffusion Video Editing. https://doi.org/10.48550/arXiv.2308.09592
  18. Gael Chandler. 2004. Cut by cut: editing your film or video. Michael Wiese Productions, Studio City, CA.
  19. RubySlippers: Supporting Content-based Voice Navigation for How-to Videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3411764.3445131
  20. Augmenting Sports Videos with VisCommentator. IEEE Transactions on Visualization and Computer Graphics 28, 1 (Jan. 2022), 824–834. https://doi.org/10.1109/TVCG.2021.3114806
  21. Erin Cherry and Celine Latulipe. 2014. Quantifying the Creativity Support of Digital Tools through the Creativity Support Index. ACM Transactions on Computer-Human Interaction 21, 4 (June 2014), 21:1–21:25. https://doi.org/10.1145/2617588
  22. Synthesis-Assisted Video Prototyping From a Document. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3526113.3545676
  23. Automatic Instructional Video Creation from a Markdown-Formatted Tutorial. In The 34th Annual ACM Symposium on User Interface Software and Technology (UIST ’21). Association for Computing Machinery, New York, NY, USA, 677–690. https://doi.org/10.1145/3472749.3474778
  24. Automatic Video Creation From a Web Page. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST ’20). Association for Computing Machinery, New York, NY, USA, 279–292. https://doi.org/10.1145/3379337.3415814
  25. DemoCut: generating concise instructional videos for physical demonstrations. In Proceedings of the 26th annual ACM symposium on User interface software and technology (UIST ’13). Association for Computing Machinery, New York, NY, USA, 141–150. https://doi.org/10.1145/2501988.2502052
  26. TaleBrush: Sketching Stories with Generative Pretrained Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/3491102.3501819
  27. Codecademy. 2018. Livestream: Getting Started with C++ (Episode 1). Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/OKQpOzEY_A4
  28. Descript. 2024. Descript. https://www.descript.com Accessed: 2024-01-25.
  29. Nick DiGiovanni. 2023. Learn To Cook In Less Than 1 Hour. Video. Retrieved on 2024-01-25 from https://youtu.be/zhI7bQyTmHw
  30. edX LLC. 2024. edX. https://www.edx.org Accessed: 2024-01-25.
  31. Logan Fiorella and Richard E. Mayer. 2018. What works and doesn’t work with instructional video. Computers in Human Behavior 89 (Dec. 2018), 465–470. https://doi.org/10.1016/j.chb.2018.07.015
  32. Ohad Fried and Maneesh Agrawala. 2019. Puppet Dubbing. https://arxiv.org/abs/1902.04285v1
  33. Text-based editing of talking-head video. ACM Transactions on Graphics 38, 4 (July 2019), 68:1–68:14. https://doi.org/10.1145/3306346.3323028
  34. Easy Navigation through Instructional Videos using Automatically Generated Table of Content. In Companion Publication of the 21st International Conference on Intelligent User Interfaces (IUI ’16 Companion). Association for Computing Machinery, New York, NY, USA, 92–96. https://doi.org/10.1145/2876456.2879472
  35. Google. 2024. Google Slides. https://slides.google.com Accessed: 2024-01-25.
  36. How video production affects student engagement: an empirical study of MOOC videos. In Proceedings of the first ACM conference on Learning @ scale conference (L@S ’14). Association for Computing Machinery, New York, NY, USA, 41–50. https://doi.org/10.1145/2556325.2566239
  37. Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Advances in Psychology, Peter A. Hancock and Najmedin Meshkati (Eds.). Human Mental Workload, Vol. 52. North-Holland, 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9
  38. Imagen Video: High Definition Video Generation with Diffusion Models. https://doi.org/10.48550/arXiv.2210.02303
  39. P. T. Hove. 2014. Characteristics of instructional videos for conceptual knowledge development. https://www.semanticscholar.org/paper/Characteristics-of-instructional-videos-for-Hove/c377da3ea8c08dbe79cd36927b25154ecb51cb48
  40. LazyCut: content-aware template-based video authoring. In Proceedings of the 13th annual ACM international conference on Multimedia (MULTIMEDIA ’05). Association for Computing Machinery, New York, NY, USA, 792–793. https://doi.org/10.1145/1101149.1101318
  41. Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer. https://doi.org/10.48550/arXiv.2305.05464
  42. B-Script: Transcript-based B-roll Video Editing with Recommendations. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300311
  43. AVscript: Accessible Video Editing with Audio-Visual Scripts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3544548.3581494
  44. Imvidu. 2024. Imvidu. https://imvidu.com Accessed: 2024-01-25.
  45. Apple Inc. 2024a. iMovie. https://www.apple.com/ca/imovie Accessed: 2024-01-25.
  46. Coursera Inc. 2024b. Coursera. https://www.coursera.org Accessed: 2024-01-25.
  47. Upwork Global Inc. 2024. Upwork. https://www.upwork.com/ Accessed: 2024-01-25.
  48. Zoom Video Communications Inc. 2024. Zoom. https://zoom.us Accessed: 2024-01-25.
  49. Amir Jahanlou and Parmit K Chilana. 2022. Katika: An End-to-End System for Authoring Amateur Explainer Motion Graphics Videos. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3491102.3517741
  50. Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3544548.3581196
  51. Empirical observations on video editing in the mobile context. In Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology (Mobility ’07). Association for Computing Machinery, New York, NY, USA, 482–489. https://doi.org/10.1145/1378063.1378140
  52. Videolization: knowledge graph based automated video generation from web content. Multimedia Tools and Applications 77, 1 (Jan. 2018), 567–595. https://doi.org/10.1007/s11042-016-4275-4
  53. “I Would Just Ask Someone”: Learning Feature-Rich Design Software in the Modern Workplace. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 1–10. https://doi.org/10.1109/VL/HCC50065.2020.9127288 ISSN: 1943-6106.
  54. Crowdsourcing step-by-step information extraction to enhance existing how-to videos. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). Association for Computing Machinery, New York, NY, USA, 4017–4026. https://doi.org/10.1145/2556288.2556986
  55. Stylette: Styling the Web with Natural Language. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3491102.3501931
  56. Segment Anything. https://doi.org/10.48550/arXiv.2304.02643
  57. KonvaJS. 2024. KonvaJS. https://konvajs.org Accessed: 2024-01-25.
  58. LangChain. 2024. LangChain. https://www.langchain.com Accessed: 2024-01-25.
  59. PixelTone: a multimodal interface for image editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’13). Association for Computing Machinery, New York, NY, USA, 2185–2194. https://doi.org/10.1145/2470654.2481301
  60. Computational video editing for dialogue-driven scenes. ACM Transactions on Graphics 36, 4 (July 2017), 130:1–130:14. https://doi.org/10.1145/3072959.3073653
  61. Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3313831.3376519
  62. Multimodal interaction for data visualization. In Proceedings of the 2018 International Conference on Advanced Visual Interfaces (AVI ’18). Association for Computing Machinery, New York, NY, USA, 1–3. https://doi.org/10.1145/3206505.3206602
  63. Soundini: Sound-Guided Diffusion for Natural Video Editing. https://doi.org/10.48550/arXiv.2304.06818
  64. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. https://doi.org/10.48550/arXiv.2301.12597 arXiv:2301.12597 [cs].
  65. RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live Storytelling. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3526113.3545702
  66. VideoMap: Video Editing in Latent Space. https://arxiv.org/abs/2211.12492v1
  67. Steve Kaufmann lingosteve. 2019. Language Learning Live Stream. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/3_nLdcHBJY4
  68. Doctor Gary Linkov. 2022. Surgeon does Live QA — Hair Loss Awareness Month. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/sz8Lo3NY1m0
  69. Zachary C. Lipton. 2017. The Mythos of Model Interpretability. https://doi.org/10.48550/arXiv.1606.03490
  70. Google LLC. 2024. YouTube. https://www.youtube.com Accessed: 2024-01-25.
  71. Blackmagic Design Pty. Ltd. 2024. DaVinci Resolve. https://www.blackmagicdesign.com/products/davinciresolve Accessed: 2024-01-25.
  72. Sketch-Based Annotation and Visualization in Video Authoring. IEEE Transactions on Multimedia 14, 4 (Aug. 2012), 1153–1165. https://doi.org/10.1109/TMM.2012.2190389
  73. Automated Conversion of Music Videos into Lyric Videos. https://doi.org/10.1145/3586183.3606757
  74. Meta. 2024. React. https://react.dev Accessed: 2024-01-25.
  75. MobX. 2024. MobX. https://mobx.js.org Accessed: 2024-01-25.
  76. Jamie Oliver. 2011. Jamie Oliver live - pasta. Video. Retrieved on 2024-01-25 from https://youtu.be/b3TVLNNqgdc
  77. OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
  78. When do we interact multimodally? cognitive load and multimodal communication patterns. In Proceedings of the 6th international conference on Multimodal interfaces (ICMI ’04). Association for Computing Machinery, New York, NY, USA, 129–136. https://doi.org/10.1145/1027933.1027957
  79. Pallets. 2024. Flask. https://flask.palletsprojects.com Accessed: 2024-01-25.
  80. A Human-Computer Collaborative Editing Tool for Conceptual Diagrams. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–29. https://doi.org/10.1145/3544548.3580676
  81. SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST ’15). Association for Computing Machinery, New York, NY, USA, 181–190. https://doi.org/10.1145/2807442.2807502
  82. VidCrit: Video-based Asynchronous Video Review. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST ’16). Association for Computing Machinery, New York, NY, USA, 517–528. https://doi.org/10.1145/2984511.2984552
  83. Video digests: a browsable, skimmable format for informational lecture videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST ’14). Association for Computing Machinery, New York, NY, USA, 573–582. https://doi.org/10.1145/2642918.2647400
  84. Gillian Perkins. 2020. How To SURVIVE As An Entrepreneur. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/oYMAX90kNkU
  85. Gillian Perkins. 2023. the mindset shift that will finally change your work-life. Video. Retrieved on 2024-01-25 from https://youtu.be/T8LE3SpZdag
  86. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. https://doi.org/10.48550/arXiv.2303.09535
  87. InstructVid2Vid: Controllable Video Editing with Natural Language Instructions. https://doi.org/10.48550/arXiv.2305.12328
  88. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020
  89. Gordon Ramsey. 2021. At Home for the Holidays with Gordon Ramsay. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/kdN41iYTg3U
  90. Nils Reimers. 2024. SentenceTransformers. https://www.sbert.net Accessed: 2024-01-25.
  91. Interactive Body-Driven Graphics for Augmented Video Performance. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300852
  92. Jeff Sauro. 2011. A Practical Guide to the System Usability Scale: Background, Benchmarks & Best Practices. CreateSpace Independent Publishing Platform. Open Library ID: OL26858541M.
  93. Understanding the Effect of In-Video Prompting on Learners and Instructors. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3173893
  94. Natural multimodal interaction in immersive flow visualization. Visual Informatics 5, 4 (Dec. 2021), 56–66. https://doi.org/10.1016/j.visinf.2021.12.005
  95. A deep learning approach for generalized speech animation. ACM Transactions on Graphics 36, 4 (July 2017), 93:1–93:11. https://doi.org/10.1145/3072959.3073699
  96. Linus Tech Tips. 2018. Microsoft Surface Go - Classic LIVE Unboxing. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/4LdIvyfzoGY
  97. QuickCut: An Interactive Tool for Editing Narrated Video. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST ’16). Association for Computing Machinery, New York, NY, USA, 497–507. https://doi.org/10.1145/2984511.2984569
  98. Recipe2Video: Synthesizing Personalized Videos from Recipe Texts. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2267–2276. https://doi.org/10.1109/WACV56688.2023.00230 ISSN: 2642-9381.
  99. Jan Van Der Kamp and Veronica Sundstedt. 2011. Gaze and voice controlled drawing. In Proceedings of the 1st Conference on Novel Gaze-Controlled Applications. ACM, Karlskrona Sweden, 1–8. https://doi.org/10.1145/1983302.1983311
  100. Write-a-video: computational video montage from themed text. ACM Transactions on Graphics 38, 6 (Nov. 2019), 177:1–177:13. https://doi.org/10.1145/3355089.3356520
  101. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. https://arxiv.org/abs/2212.03191v2
  102. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. https://doi.org/10.48550/arXiv.2110.01691
  103. Crosscast: Adding Visuals to Audio Travel Podcasts. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST ’20). Association for Computing Machinery, New York, NY, USA, 735–746. https://doi.org/10.1145/3379337.3415882
  104. CatchLive: Real-time Summarization of Live Streams with Stream Content and Interaction Data. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–20. https://doi.org/10.1145/3491102.3517461
  105. Iterative Text-based Editing of Talking-heads Using Neural Retargeting. https://doi.org/10.48550/arXiv.2011.10688
  106. youtube-dl developers. 2024. youtube-dl. https://ytdl-org.github.io/youtube-dl Accessed: 2024-01-25.
  107. “Rewind to the Jiggling Meat Part”: Understanding Voice Control of Instructional Videos in Everyday Tasks. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3491102.3502036
  108. HelpViz: Automatic Generation of Contextual Visual Mobile Tutorials from Text-Based Instructions. In The 34th Annual ACM Symposium on User Interface Software and Technology (UIST ’21). Association for Computing Machinery, New York, NY, USA, 1144–1153. https://doi.org/10.1145/3472749.3474812
  109. Reducing the Cognitive Load of Playing a Digital Tabletop Game with a Multimodal Interface. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3491102.3502062
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets