ExpressEdit: Video Editing with Natural Language and Sketching (2403.17693v1)
Abstract: Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality$-$natural language (NL) and sketching, which are natural modalities humans use for expression$-$can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.
- The performance and cognitive workload analysis of a multimodal speech and visual gesture (mSVG) UAV control interface. Robotics and Autonomous Systems 147 (Jan. 2022), 103915. https://doi.org/10.1016/j.robot.2021.103915
- Khan Academy. 2024. Khan Academy. https://www.khanacademy.org Accessed: 2024-01-25.
- Adobe. 2024a. Adobe Photoshop. https://www.adobe.com/products/photoshop.html Accessed: 2024-01-25.
- Adobe. 2024b. Adobe Premiere Pro. https://www.adobe.com/products/premiere.html Accessed: 2024-01-25.
- Remotion AG. 2024. Remotion. https://www.remotion.dev Accessed: 2024-01-25.
- Guidelines for Human-AI Interaction. https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/
- Apple. 2024. Final Cut Pro. https://www.apple.com/final-cut-pro Accessed: 2024-01-25.
- Text2LIVE: Text-Driven Layered Image and Video Editing. https://doi.org/10.48550/arXiv.2204.02491
- Tools for placing cuts and transitions in interview video. ACM Transactions on Graphics 31, 4 (July 2012), 67:1–67:8. https://doi.org/10.1145/2185520.2185563
- Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165v4
- To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (April 2021), 188:1–188:21. https://doi.org/10.1145/3449287
- Mireille Bétrancourt and Kalliopi Benetos. 2018. Why and when does instructional video facilitate learning? A commentary to the special issue “developments and trends in learning with instructional video”. Computers in Human Behavior 89 (Dec. 2018), 471–475. https://doi.org/10.1016/j.chb.2018.08.035
- Diogo Cabral and Nuno Correia. 2017. Video editing with pen-based technology. Multimedia Tools and Applications 76, 5 (March 2017), 6889–6914. https://doi.org/10.1007/s11042-016-3329-y
- Linda Candy. 2013. Evaluating Creativity. 57–84. https://doi.org/10.1007/978-1-4471-4111-2_4
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. https://doi.org/10.48550/arXiv.2303.04226
- Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques (DIS ’02). Association for Computing Machinery, New York, NY, USA, 157–166. https://doi.org/10.1145/778712.778737
- StableVideo: Text-driven Consistency-aware Diffusion Video Editing. https://doi.org/10.48550/arXiv.2308.09592
- Gael Chandler. 2004. Cut by cut: editing your film or video. Michael Wiese Productions, Studio City, CA.
- RubySlippers: Supporting Content-based Voice Navigation for How-to Videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3411764.3445131
- Augmenting Sports Videos with VisCommentator. IEEE Transactions on Visualization and Computer Graphics 28, 1 (Jan. 2022), 824–834. https://doi.org/10.1109/TVCG.2021.3114806
- Erin Cherry and Celine Latulipe. 2014. Quantifying the Creativity Support of Digital Tools through the Creativity Support Index. ACM Transactions on Computer-Human Interaction 21, 4 (June 2014), 21:1–21:25. https://doi.org/10.1145/2617588
- Synthesis-Assisted Video Prototyping From a Document. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3526113.3545676
- Automatic Instructional Video Creation from a Markdown-Formatted Tutorial. In The 34th Annual ACM Symposium on User Interface Software and Technology (UIST ’21). Association for Computing Machinery, New York, NY, USA, 677–690. https://doi.org/10.1145/3472749.3474778
- Automatic Video Creation From a Web Page. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST ’20). Association for Computing Machinery, New York, NY, USA, 279–292. https://doi.org/10.1145/3379337.3415814
- DemoCut: generating concise instructional videos for physical demonstrations. In Proceedings of the 26th annual ACM symposium on User interface software and technology (UIST ’13). Association for Computing Machinery, New York, NY, USA, 141–150. https://doi.org/10.1145/2501988.2502052
- TaleBrush: Sketching Stories with Generative Pretrained Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/3491102.3501819
- Codecademy. 2018. Livestream: Getting Started with C++ (Episode 1). Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/OKQpOzEY_A4
- Descript. 2024. Descript. https://www.descript.com Accessed: 2024-01-25.
- Nick DiGiovanni. 2023. Learn To Cook In Less Than 1 Hour. Video. Retrieved on 2024-01-25 from https://youtu.be/zhI7bQyTmHw
- edX LLC. 2024. edX. https://www.edx.org Accessed: 2024-01-25.
- Logan Fiorella and Richard E. Mayer. 2018. What works and doesn’t work with instructional video. Computers in Human Behavior 89 (Dec. 2018), 465–470. https://doi.org/10.1016/j.chb.2018.07.015
- Ohad Fried and Maneesh Agrawala. 2019. Puppet Dubbing. https://arxiv.org/abs/1902.04285v1
- Text-based editing of talking-head video. ACM Transactions on Graphics 38, 4 (July 2019), 68:1–68:14. https://doi.org/10.1145/3306346.3323028
- Easy Navigation through Instructional Videos using Automatically Generated Table of Content. In Companion Publication of the 21st International Conference on Intelligent User Interfaces (IUI ’16 Companion). Association for Computing Machinery, New York, NY, USA, 92–96. https://doi.org/10.1145/2876456.2879472
- Google. 2024. Google Slides. https://slides.google.com Accessed: 2024-01-25.
- How video production affects student engagement: an empirical study of MOOC videos. In Proceedings of the first ACM conference on Learning @ scale conference (L@S ’14). Association for Computing Machinery, New York, NY, USA, 41–50. https://doi.org/10.1145/2556325.2566239
- Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Advances in Psychology, Peter A. Hancock and Najmedin Meshkati (Eds.). Human Mental Workload, Vol. 52. North-Holland, 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9
- Imagen Video: High Definition Video Generation with Diffusion Models. https://doi.org/10.48550/arXiv.2210.02303
- P. T. Hove. 2014. Characteristics of instructional videos for conceptual knowledge development. https://www.semanticscholar.org/paper/Characteristics-of-instructional-videos-for-Hove/c377da3ea8c08dbe79cd36927b25154ecb51cb48
- LazyCut: content-aware template-based video authoring. In Proceedings of the 13th annual ACM international conference on Multimedia (MULTIMEDIA ’05). Association for Computing Machinery, New York, NY, USA, 792–793. https://doi.org/10.1145/1101149.1101318
- Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer. https://doi.org/10.48550/arXiv.2305.05464
- B-Script: Transcript-based B-roll Video Editing with Recommendations. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300311
- AVscript: Accessible Video Editing with Audio-Visual Scripts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3544548.3581494
- Imvidu. 2024. Imvidu. https://imvidu.com Accessed: 2024-01-25.
- Apple Inc. 2024a. iMovie. https://www.apple.com/ca/imovie Accessed: 2024-01-25.
- Coursera Inc. 2024b. Coursera. https://www.coursera.org Accessed: 2024-01-25.
- Upwork Global Inc. 2024. Upwork. https://www.upwork.com/ Accessed: 2024-01-25.
- Zoom Video Communications Inc. 2024. Zoom. https://zoom.us Accessed: 2024-01-25.
- Amir Jahanlou and Parmit K Chilana. 2022. Katika: An End-to-End System for Authoring Amateur Explainer Motion Graphics Videos. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3491102.3517741
- Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3544548.3581196
- Empirical observations on video editing in the mobile context. In Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology (Mobility ’07). Association for Computing Machinery, New York, NY, USA, 482–489. https://doi.org/10.1145/1378063.1378140
- Videolization: knowledge graph based automated video generation from web content. Multimedia Tools and Applications 77, 1 (Jan. 2018), 567–595. https://doi.org/10.1007/s11042-016-4275-4
- “I Would Just Ask Someone”: Learning Feature-Rich Design Software in the Modern Workplace. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 1–10. https://doi.org/10.1109/VL/HCC50065.2020.9127288 ISSN: 1943-6106.
- Crowdsourcing step-by-step information extraction to enhance existing how-to videos. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). Association for Computing Machinery, New York, NY, USA, 4017–4026. https://doi.org/10.1145/2556288.2556986
- Stylette: Styling the Web with Natural Language. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3491102.3501931
- Segment Anything. https://doi.org/10.48550/arXiv.2304.02643
- KonvaJS. 2024. KonvaJS. https://konvajs.org Accessed: 2024-01-25.
- LangChain. 2024. LangChain. https://www.langchain.com Accessed: 2024-01-25.
- PixelTone: a multimodal interface for image editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’13). Association for Computing Machinery, New York, NY, USA, 2185–2194. https://doi.org/10.1145/2470654.2481301
- Computational video editing for dialogue-driven scenes. ACM Transactions on Graphics 36, 4 (July 2017), 130:1–130:14. https://doi.org/10.1145/3072959.3073653
- Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3313831.3376519
- Multimodal interaction for data visualization. In Proceedings of the 2018 International Conference on Advanced Visual Interfaces (AVI ’18). Association for Computing Machinery, New York, NY, USA, 1–3. https://doi.org/10.1145/3206505.3206602
- Soundini: Sound-Guided Diffusion for Natural Video Editing. https://doi.org/10.48550/arXiv.2304.06818
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. https://doi.org/10.48550/arXiv.2301.12597 arXiv:2301.12597 [cs].
- RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live Storytelling. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3526113.3545702
- VideoMap: Video Editing in Latent Space. https://arxiv.org/abs/2211.12492v1
- Steve Kaufmann lingosteve. 2019. Language Learning Live Stream. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/3_nLdcHBJY4
- Doctor Gary Linkov. 2022. Surgeon does Live QA — Hair Loss Awareness Month. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/sz8Lo3NY1m0
- Zachary C. Lipton. 2017. The Mythos of Model Interpretability. https://doi.org/10.48550/arXiv.1606.03490
- Google LLC. 2024. YouTube. https://www.youtube.com Accessed: 2024-01-25.
- Blackmagic Design Pty. Ltd. 2024. DaVinci Resolve. https://www.blackmagicdesign.com/products/davinciresolve Accessed: 2024-01-25.
- Sketch-Based Annotation and Visualization in Video Authoring. IEEE Transactions on Multimedia 14, 4 (Aug. 2012), 1153–1165. https://doi.org/10.1109/TMM.2012.2190389
- Automated Conversion of Music Videos into Lyric Videos. https://doi.org/10.1145/3586183.3606757
- Meta. 2024. React. https://react.dev Accessed: 2024-01-25.
- MobX. 2024. MobX. https://mobx.js.org Accessed: 2024-01-25.
- Jamie Oliver. 2011. Jamie Oliver live - pasta. Video. Retrieved on 2024-01-25 from https://youtu.be/b3TVLNNqgdc
- OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
- When do we interact multimodally? cognitive load and multimodal communication patterns. In Proceedings of the 6th international conference on Multimodal interfaces (ICMI ’04). Association for Computing Machinery, New York, NY, USA, 129–136. https://doi.org/10.1145/1027933.1027957
- Pallets. 2024. Flask. https://flask.palletsprojects.com Accessed: 2024-01-25.
- A Human-Computer Collaborative Editing Tool for Conceptual Diagrams. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–29. https://doi.org/10.1145/3544548.3580676
- SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST ’15). Association for Computing Machinery, New York, NY, USA, 181–190. https://doi.org/10.1145/2807442.2807502
- VidCrit: Video-based Asynchronous Video Review. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST ’16). Association for Computing Machinery, New York, NY, USA, 517–528. https://doi.org/10.1145/2984511.2984552
- Video digests: a browsable, skimmable format for informational lecture videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST ’14). Association for Computing Machinery, New York, NY, USA, 573–582. https://doi.org/10.1145/2642918.2647400
- Gillian Perkins. 2020. How To SURVIVE As An Entrepreneur. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/oYMAX90kNkU
- Gillian Perkins. 2023. the mindset shift that will finally change your work-life. Video. Retrieved on 2024-01-25 from https://youtu.be/T8LE3SpZdag
- FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. https://doi.org/10.48550/arXiv.2303.09535
- InstructVid2Vid: Controllable Video Editing with Natural Language Instructions. https://doi.org/10.48550/arXiv.2305.12328
- Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020
- Gordon Ramsey. 2021. At Home for the Holidays with Gordon Ramsay. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/kdN41iYTg3U
- Nils Reimers. 2024. SentenceTransformers. https://www.sbert.net Accessed: 2024-01-25.
- Interactive Body-Driven Graphics for Augmented Video Performance. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300852
- Jeff Sauro. 2011. A Practical Guide to the System Usability Scale: Background, Benchmarks & Best Practices. CreateSpace Independent Publishing Platform. Open Library ID: OL26858541M.
- Understanding the Effect of In-Video Prompting on Learners and Instructors. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3173893
- Natural multimodal interaction in immersive flow visualization. Visual Informatics 5, 4 (Dec. 2021), 56–66. https://doi.org/10.1016/j.visinf.2021.12.005
- A deep learning approach for generalized speech animation. ACM Transactions on Graphics 36, 4 (July 2017), 93:1–93:11. https://doi.org/10.1145/3072959.3073699
- Linus Tech Tips. 2018. Microsoft Surface Go - Classic LIVE Unboxing. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/4LdIvyfzoGY
- QuickCut: An Interactive Tool for Editing Narrated Video. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST ’16). Association for Computing Machinery, New York, NY, USA, 497–507. https://doi.org/10.1145/2984511.2984569
- Recipe2Video: Synthesizing Personalized Videos from Recipe Texts. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2267–2276. https://doi.org/10.1109/WACV56688.2023.00230 ISSN: 2642-9381.
- Jan Van Der Kamp and Veronica Sundstedt. 2011. Gaze and voice controlled drawing. In Proceedings of the 1st Conference on Novel Gaze-Controlled Applications. ACM, Karlskrona Sweden, 1–8. https://doi.org/10.1145/1983302.1983311
- Write-a-video: computational video montage from themed text. ACM Transactions on Graphics 38, 6 (Nov. 2019), 177:1–177:13. https://doi.org/10.1145/3355089.3356520
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning. https://arxiv.org/abs/2212.03191v2
- AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. https://doi.org/10.48550/arXiv.2110.01691
- Crosscast: Adding Visuals to Audio Travel Podcasts. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST ’20). Association for Computing Machinery, New York, NY, USA, 735–746. https://doi.org/10.1145/3379337.3415882
- CatchLive: Real-time Summarization of Live Streams with Stream Content and Interaction Data. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–20. https://doi.org/10.1145/3491102.3517461
- Iterative Text-based Editing of Talking-heads Using Neural Retargeting. https://doi.org/10.48550/arXiv.2011.10688
- youtube-dl developers. 2024. youtube-dl. https://ytdl-org.github.io/youtube-dl Accessed: 2024-01-25.
- “Rewind to the Jiggling Meat Part”: Understanding Voice Control of Instructional Videos in Everyday Tasks. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3491102.3502036
- HelpViz: Automatic Generation of Contextual Visual Mobile Tutorials from Text-Based Instructions. In The 34th Annual ACM Symposium on User Interface Software and Technology (UIST ’21). Association for Computing Machinery, New York, NY, USA, 1144–1153. https://doi.org/10.1145/3472749.3474812
- Reducing the Cognitive Load of Playing a Digital Tabletop Game with a Multimodal Interface. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3491102.3502062