Rewriting the Script: Adapting Text Instructions for Voice Interaction (2306.09992v1)
Abstract: Voice assistants have sharply risen in popularity in recent years, but their use has been limited mostly to simple applications like music, hands-free search, or control of internet-of-things devices. What would it take for voice assistants to guide people through more complex tasks? In our work, we study the limitations of the dominant approach voice assistants take to complex task guidance: reading aloud written instructions. Using recipes as an example, we observe twelve participants cook at home with a state-of-the-art voice assistant. We learn that the current approach leads to nine challenges, including obscuring the bigger picture, overwhelming users with too much information, and failing to communicate affordances. Instructions delivered by a voice assistant are especially difficult because they cannot be skimmed as easily as written instructions. Alexa in particular did not surface crucial details to the user or answer questions well. We draw on our observations to propose eight ways in which voice assistants can ``rewrite the script'' -- summarizing, signposting, splitting, elaborating, volunteering, reordering, redistributing, and visualizing -- to transform written sources into forms that are readily communicated through spoken conversation. We conclude with a vision of how modern advancements in natural language processing can be leveraged for intelligent agents to guide users effectively through complex tasks.
- Towards More Transactional Voice Assistants: Investigating the Potential for a Multimodal Voice-Activated Indoor Navigation Assistant for Blind and Sighted Travelers. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–16. https://doi.org/10.1145/3411764.3445638
- Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300233
- Music, Search, and IoT: How People (Really) Use Voice Assistants. ACM Transactions on Computer-Human Interaction 26, 3 (April 2019). https://doi.org/10.1145/3311956
- Hugh Beyer and Karen Holtzblatt. 1997. Contextual Design: Defining Customer-Centered Systems. Morgan Kaufmann Publishers, San Francisco, CA, USA.
- Lon Binder and Jessica Binder. 2009. Perfectly Cooked Spinach. https://flic.kr/p/61rqjw
- Pre-Recorded Instructional Audio vs. Dispatchers’ Conversational Assistance in Telephone Cardiopulmonary Resuscitation: A Randomized Controlled Simulation Study. World Journal of Emergency Medicine 9, 3 (2018), 165–171. https://doi.org/10.5847/wjem.j.1920-8642.2018.03.001
- Qualitative HCI Research: Going Behind the Scenes. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-031-02217-3
- MultiWOZ – A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 5016–5026. https://doi.org/10.18653/v1/D18-1547
- Vitro: Designing a Voice Assistant for the Scientific Lab Workplace. In Proceedings of the 2019 on Designing Interactive Systems Conference (DIS ’19). Association for Computing Machinery, New York, NY, USA, 1531–1542. https://doi.org/10.1145/3322276.3322298
- Robin: Enabling Independence For Individuals With Cognitive Disabilities Using Voice Assistive Technology. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’17). Association for Computing Machinery, New York, NY, USA, 46–53. https://doi.org/10.1145/3027063.3049266
- John M. Carroll. 1990. The Nurnberg Funnel: Designing Minimalist Instruction for Practical Computer Skill. https://mitpress.mit.edu/9780262031639/the-nurnberg-funnel/
- Dhivya Chandrasekaran and Vijay Mago. 2021. Evolution of Semantic Similarity—A Survey. Comput. Surveys 54, 2 (Feb. 2021), 41:1–41:37. https://doi.org/10.1145/3440755
- RecipeScape: An Interactive Tool for Analyzing Cooking Instructions at Scale. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3174025
- How to Design Voice Based Navigation for How-To Videos. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300931
- A Smart Kitchen for Nutrition-Aware Cooking. IEEE Pervasive Computing 9, 04 (Oct. 2010), 58–65. https://doi.org/10.1109/MPRV.2010.75
- What Makes a Good Conversation? Challenges in Designing Truly Conversational Agents. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300705
- Alana Claxton. 2019. Cooking Lessons: Oral Recipe Sharing in the Southern Kitchen. Master’s thesis. East Tennessee State University. https://dc.etsu.edu/etd/3550/
- Everything Happens for a Reason: Discovering the Purpose of Actions in Procedural Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 4496–4505. https://doi.org/10.18653/v1/D19-1457
- Gregory Druck and Bo Pang. 2012. Spice It Up? Mining Refinements to Online Instructions from User Generated Content. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Jeju Island, Korea, 545–553. https://aclanthology.org/P12-1057
- Angela Duckworth and Lyle Ungar. 2023. Op-Ed: Don’t Ban Chatbots in Classrooms — Use Them to Change How We Teach. Yahoo Entertainment (Jan. 2023). https://www.yahoo.com/entertainment/op-ed-dont-ban-chatbots-112037967.html
- The Psychology of Following Instructions and Its Implications. American Journal of Pharmaceutical Education 84, 8 (Aug. 2020). https://doi.org/10.5688/ajpe7779
- Elsa Eiriksdottir and Richard Catrambone. 2011. Procedural Instructions, Principles, and Examples: How to Structure Instructions for Procedural Tasks to Enhance Performance, Learning, and Transfer. Human Factors 53, 6 (Dec. 2011), 749–770. https://doi.org/10.1177/0018720811419154
- Iris: A Conversational Agent for Complex Tasks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3174047
- Summarizing Procedural Text: Data and Approach. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2216–2225. https://aclanthology.org/2022.findings-emnlp.162
- Intelligent Tutoring Systems with Conversational Dialogue. AI Magazine 22, 4 (Dec. 2001), 39–39. https://doi.org/10.1609/aimag.v22i4.1591
- Atomized or Delayed Execution? An Alternative Paradigm for the Study of Procedural Learning. Journal of Educational Psychology 111 (2019), 1406–1415. https://doi.org/10.1037/edu0000357 Place: US Publisher: American Psychological Association.
- Mise en Place: Unsupervised Interpretation of Instructional Recipes. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 982–992. https://doi.org/10.18653/v1/D15-1114
- BiSECT: Learning to Split and Rephrase Sentences with Bitexts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP ’21). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6193–6209. https://doi.org/10.18653/v1/2021.emnlp-main.500
- The Digital Cooking Coach: Using Visual and Auditory In-Situ Instructions to Assist Cognitively Impaired during Cooking. In Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments (PETRA ’19). Association for Computing Machinery, New York, NY, USA, 156–163. https://doi.org/10.1145/3316782.3321524
- Heuristic Evaluation of Conversational Agents. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3411764.3445312
- Lessons from Oz: Design Guidelines for Automotive Conversational User Interfaces. In Proceedings of the 11th International Conference on Automotive User Interfaces and Interactive Vehicular Applications: Adjunct Proceedings (AutomotiveUI ’19). Association for Computing Machinery, New York, NY, USA, 335–340. https://doi.org/10.1145/3349263.3351314
- Goal-Oriented Script Construction. In Proceedings of the 14th International Conference on Natural Language Generation. Association for Computational Linguistics, Aberdeen, Scotland, UK, 184–200. https://aclanthology.org/2021.inlg-1.19
- George A. Miller. 1956. The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. Psychological Review 63 (1956), 81–97. https://doi.org/10.1037/h0043158 Place: US Publisher: American Psychological Association.
- Lauren Miyashiro. 2023. Tuscan Butter Salmon. https://www.delish.com/cooking/recipe-ideas/recipes/a58412/tuscan-butter-salmon-recipe/ Section: Recipes.
- Christine Murad and Cosmin Munteanu. 2019. ”I Don’t Know What You’re Talking About, HALexa”: The Case for Voice User Interface Guidelines. In Proceedings of the 1st International Conference on Conversational User Interfaces (CUI ’19). Association for Computing Machinery, New York, NY, USA, 1–3. https://doi.org/10.1145/3342775.3342795
- Design Guidelines for Hands-Free Speech Interaction. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct (MobileHCI ’18). Association for Computing Machinery, New York, NY, USA, 269–276. https://doi.org/10.1145/3236112.3236149
- Finding a New Voice: Transitioning Designers from GUI to VUI Design. In Proceedings of the 3rd Conference on Conversational User Interfaces (CUI ’21). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3469595.3469617
- Neal R. Norrick. 2011. Conversational Recipe Telling. Journal of Pragmatics 43, 11 (Sept. 2011), 2740–2761. https://doi.org/10.1016/j.pragma.2011.04.010
- Supporting Complex Tasks Using Multiple Devices. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery. https://www.microsoft.com/en-us/research/publication/supporting-complex-tasks-using-multiple-devices/
- OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/ Publication Title: OpenAI.
- QuakerBot: A Household Dialog System Powered by Large Language Models. In Alexa Prize TaskBot Challenge Proceedings. https://www.amazon.science/alexa-prize/proceedings/quakerbot-a-household-dialog-system-powered-by-large-language-models
- Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3174214
- Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, 8821–8831. https://proceedings.mlr.press/v139/ramesh21a.html
- A Recipe for Arbitrary Text Style Transfer with Large Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (ACL ’22). Association for Computational Linguistics, Dublin, Ireland, 837–848. https://doi.org/10.18653/v1/2022.acl-short.94
- High-Resolution Image Synthesis With Latent Diffusion Models (CVPR ’22). 10684–10695. https://openaccess.thecvf.com/content/CVPR2022/html/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.html
- MimiCook: A Cooking Assistant System with Situated Guidance. In Proceedings of the 8th International Conference on Tangible, Embedded and Embodied Interaction (TEI ’14). Association for Computing Machinery, New York, NY, USA, 121–124. https://doi.org/10.1145/2540930.2540952
- Drill Sergeant: Supporting Physical Construction Projects through an Ecosystem of Augmented Tools. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’16). Association for Computing Machinery, New York, NY, USA, 1607–1614. https://doi.org/10.1145/2851581.2892429
- ”Hey Alexa, What’s Up?”: A Mixed-Methods Studies of In-Home Conversational Agent Usage. In Proceedings of the 2018 Designing Interactive Systems Conference (DIS ’18). Association for Computing Machinery, New York, NY, USA, 857–868. https://doi.org/10.1145/3196709.3196772
- Voicepedia: Towards Speech-Based Access to Unstructured Information. ISCA, 146–149. https://www.semanticscholar.org/paper/Voicepedia%3A-towards-speech-based-access-to-Sherwani-Yu/dd6f19b38072b2dad6633bd31879b7f9e7138dcd
- ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI ’17). AAAI Press, San Francisco, California, USA, 4444–4451.
- Multimedia Instructions and Cognitive Load Theory: Effects of Modality and Cueing. The British Journal of Educational Psychology 74, Pt 1 (March 2004), 71–81. https://doi.org/10.1348/000709904322848824
- Competence-based Question Generation. In Proceedings of the 29th International Conference on Computational Linguistics (COLING ’22). International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 1521–1533. https://aclanthology.org/2022.coling-1.131
- Alexandra Vtyurina and Adam Fourney. 2018. Exploring the Role of Conversational Cues in Guided Task Support with Virtual Assistants. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3173574.3173782
- Eliciting and Analysing Users’ Envisioned Dialogues with Perfect Voice Assistants. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3411764.3445536
- Alexa, Can You Help Us Solve This Problem? How Conversations With Smart Personal Assistant Tutors Increase Task Group Outcomes. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3290607.3313090
- TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogue. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 917–929. https://doi.org/10.18653/v1/2020.emnlp-main.66
- Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval. Computing Research Repository abs/2111.09276 (2021). https://arxiv.org/abs/2111.09276 arXiv: 2111.09276.
- Visual Goal-Step Inference using WikiHow. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP ’21). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2167–2179. https://doi.org/10.18653/v1/2021.emnlp-main.165
- Nicole Yankelovich and Eric Baatz. 1994. SpeechActs: A Framework for Building Speech Applications. In AVIOS ’94 Conference Proceedings. Sun Microsystems Laboratories, Inc., 20–23. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e785ab9af208afcafe9ef1a222495df745033155
- Designing SpeechActs: Issues in Speech User Interfaces. In Proceedings of the 1995 CHI Conference on Human Factors in Computing Systems (CHI ’95). 369–376.
- Reasoning about Goals, Steps, and Temporal Ordering with WikiHow. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (EMNLP ’20). Association for Computational Linguistics, Online, 4630–4639. https://doi.org/10.18653/v1/2020.emnlp-main.374
- Small but Mighty: New Benchmarks for Split and Rephrase. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (EMNLP ’20). Association for Computational Linguistics, Online, 1198–1205. https://doi.org/10.18653/v1/2020.emnlp-main.91
- Automatic Chain of Thought Prompting in Large Language Models. https://doi.org/10.48550/arXiv.2210.03493 arXiv:2210.03493 [cs].
- “Rewind to the Jiggling Meat Part”: Understanding Voice Control of Instructional Videos in Everyday Tasks. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3491102.3502036
- Unsupervised Multi-Granularity Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022 (Findings ’22). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 4980–4995. https://aclanthology.org/2022.findings-emnlp.366
- Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL ’22). Association for Computational Linguistics, Dublin, Ireland, 2998–3012. https://doi.org/10.18653/v1/2022.acl-long.214