Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RealitySummary: Exploring On-Demand Mixed Reality Text Summarization and Question Answering using Large Language Models (2405.18620v2)

Published 28 May 2024 in cs.HC, cs.AI, and cs.CL

Abstract: LLMs are gaining popularity as tools for reading and summarization aids. However, little is known about their potential benefits when integrated with mixed reality (MR) interfaces to support everyday reading assistants. We developed RealitySummary, an MR reading assistant that seamlessly integrates LLMs with always-on camera access, OCR-based text extraction, and augmented spatial and visual responses in MR interfaces. Developed iteratively, RealitySummary evolved across three versions, each shaped by user feedback and reflective analysis: 1) a preliminary user study to understand user perceptions (N=12), 2) an in-the-wild deployment to explore real-world usage (N=11), and 3) a diary study to capture insights from real-world work contexts (N=5). Our findings highlight the unique advantages of combining AI and MR, including an always-on implicit assistant, minimal context switching, and spatial affordances, demonstrating significant potential for future LLM-MR interfaces beyond traditional screen-based interactions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. [n.d.]. Learning With The Times’s ’Anatomy of a Scene’. https://www.nytimes.com/2023/03/07/learning/learning-with-the-timess-anatomy-of-a-scene.html
  2. [n.d.]. Russian invasion of Ukraine. https://en.wikipedia.org/wiki/Russian_invasion_of_Ukraine
  3. Jiban Adhikary and Keith Vertanen. 2021. Text entry in virtual environments using speech and a midair keyboard. IEEE Transactions on Visualization and Computer Graphics 27, 5 (2021), 2648–2658.
  4. SummaryLens–A Smartphone App for Exploring Interactive Use of Automated Text Summarization in Everyday Life. In 27th International Conference on Intelligent User Interfaces. 93–96.
  5. Zero-Shot Opinion Summarization with GPT-3. arXiv preprint arXiv:2211.15914 (2022).
  6. The magicbook-moving seamlessly between reality and virtuality. IEEE Computer Graphics and applications 21, 3 (2001), 6–8.
  7. John Brooke. 1995. SUS: A quick and dirty usability scale. Usability Eval. Ind. 189 (11 1995).
  8. Firefox voice: an open and extensible voice assistant built upon the web. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–18.
  9. CiteSee: Augmenting Citations in Scientific Papers with Persistent and Personalized Historical Context. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–15.
  10. Marvista: A Human-AI Collaborative Reading Tool. arXiv preprint arXiv:2207.08401 (2022).
  11. Augmenting static visualizations with paparvis designer. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
  12. Kun-Hung Cheng. 2017. Reading an augmented reality book: An exploration of learners’ cognitive load, motivation, and attitudes. Australasian Journal of Educational Technology 33, 4 (2017).
  13. Medically aware GPT-3 as a data generator for medical dialogue summarization. In Machine Learning for Healthcare Conference. PMLR, 354–372.
  14. Augmented Math: Authoring AR-Based Explorable Explanations by Augmenting Static Math Textbooks. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–16.
  15. Beyond Text Generation: Supporting Writers with Continuous Automatic Text Summaries. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–13.
  16. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2019).
  17. Creating interactive physics education books with augmented reality. In Proceedings of the 24th Australian computer-human interaction conference. 107–114.
  18. Automatic text summarization: A comprehensive survey. Expert systems with applications 165 (2021), 113679.
  19. Docudesk: An interactive surface for creating and rehydrating many-to-many linkages among paper and digital documents. In 2008 3rd IEEE International Workshop on Horizontal Interactive Human Computer Systems. IEEE, 25–28.
  20. Exploring the placement and design of word-scale visualizations. IEEE Transactions on Visualization and Computer Graphics 20, 12 (2014), 2291–2300.
  21. Google. [n.d.]. Semantic Reactor. https://research.google.com/semanticexperiences/semantic-reactor.html. Accessed: 2023-03-18.
  22. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356 (2022).
  23. The design of a mixed-reality book: Is it still a real book?. In 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality. IEEE, 99–102.
  24. Replicate and reuse: Tangible interaction design for digitally-augmented physical media objects. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
  25. Augmenting scientific papers with just-in-time, position-sensitive definitions of terms and symbols. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–18.
  26. Magic Book with Augmented Reality Technology for Introducing Rare Animal. In 2020 3rd International Conference on Computer and Informatics Engineering (IC2IE). IEEE, 355–360.
  27. Informal information gathering techniques for active reading. In Proceedings of the SIGCHI conference on human factors in computing systems. 1893–1896.
  28. Pushpak: Voice command-based ebook navigator. In Proceedings of the 16th International Web for All Conference. 1–2.
  29. EncounteredLimbs: A room-scale encountered-type haptic presentation using wearable robotic arms. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 260–269.
  30. ComLittee: Literature Discovery with Personal Elected Author Committees. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–20.
  31. Development of an interactive book with augmented reality for teaching and learning geometric shapes. In 7th Iberian Conference on Information Systems and Technologies (CISTI 2012). IEEE, 1–6.
  32. Luis A Leiva. 2018. Responsive text summarization. Inform. Process. Lett. 130 (2018), 52–57.
  33. Holodoc: Enabling mixed reality workspaces that harness physical and digital content. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
  34. Pacer: fine-grained interactive paper via camera-touch hybrid gestures on a cell phone. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2441–2450.
  35. RealityTalk: Real-time speech-driven augmented presentation for AR live storytelling. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–12.
  36. VizByWiki: Mining data visualizations from the web to enrich news articles. In Proceedings of the 2018 World Wide Web Conference. 873–882.
  37. Potluck: Dynamic documents as personal software.
  38. Charagraph: Interactive Generation of Charts for Realtime Annotation of Data-Rich Paragraphs. In CHI 2023-ACM Conference on Human Factors in Computing Systems (CHI 2023). ACM.
  39. Chameleon: Bringing Interactivity to Static Digital Documents. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
  40. Fabrice Matulic and Moira C Norrie. 2012. Supporting active reading on pen and touch-operated tabletops. In Proceedings of the International Working Conference on Advanced Visual Interfaces. 612–619.
  41. Fabrice Matulic and Moira C Norrie. 2013. Pen and touch gestural environment for document editing on interactive tabletops. In Proceedings of the 2013 ACM international conference on Interactive tabletops and surfaces. 41–50.
  42. Gesture-supported document creation on pen and touch tabletops. In CHI’13 Extended Abstracts on Human Factors in Computing Systems. 1191–1196.
  43. Metatation: Annotation as implicit interaction to bridge close and distant reading. ACM Transactions on Computer-Human Interaction (TOCHI) 24, 5 (2017), 1–41.
  44. Teachable reality: Prototyping tangible augmented reality with everyday objects by leveraging interactive machine teaching. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–15.
  45. Sharon Oviatt. 2007. Multimodal interfaces. The human-computer interaction handbook (2007), 439–458.
  46. Relatedly: Scaffolding Literature Reviews with Existing Related Work Sections. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–20.
  47. Jonas Parnow and Marian Dörk. 2015. Micro visualizations: Data-driven typography and graphical text enhancement. In Proc. IEEE InfoVis Posters. 12–13.
  48. Matt Payne. 2022. State of the Art GPT-3 Summarizer For Any Size Document or Format. https://www.width.ai/post/gpt3-summarizer.
  49. Anne Peirson-Smith. 2013. Fashioning the fantastical self: An examination of the cosplay dress-up phenomenon in Southeast Asia. Fashion Theory 17, 1 (2013), 77–111.
  50. XLibris: The active reading machine. In CHI 98 conference summary on Human factors in computing systems. 22–23.
  51. Dually noted: layout-aware annotations with smartphone augmented reality. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–15.
  52. Shwetha Rajaram and Michael Nebeling. 2022. Paper trail: An immersive authoring system for augmented reality instructional experiences. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–16.
  53. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).
  54. Thomas A Robinson and Hillary P Rodrigues. 2022. World Religions: A Guide to the Essentials. Baker Academic.
  55. Speech is 3x faster than typing for english and mandarin text entry on mobile devices. arXiv preprint arXiv:1608.07323 (2016).
  56. A. J. Sellen and R. H. R. Harper. 2002. The Myth of the Paperless Office. MIT Press.
  57. Brett E Shelton and Nicholas R Hedley. 2004. Exploring a cognitive basis for learning spatial relationships with augmented reality. Technology, Instruction, Cognition and Learning 1, 4 (2004), 323.
  58. SOCRAR: Semantic OCR through Augmented Reality. In Proceedings of the 12th International Conference on the Internet of Things. 25–32.
  59. Affinity lens: data-assisted affinity diagramming with augmented reality. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–13.
  60. Texsketch: Active diagramming through pen-and-ink annotations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
  61. Craig S Tashman and W Keith Edwards. 2011a. Active reading and its discontents: the situations, problems and ideas of readers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2927–2936.
  62. Craig S Tashman and W Keith Edwards. 2011b. LiquidText: A flexible, multitouch environment to support active reading. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3285–3294.
  63. Conversations with documents: An exploration of document-centered assistance. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval. 43–52.
  64. Bret Victor. 2011. Explorable explanations.
  65. Verse: Bridging screen readers and voice assistants for enhanced eyes-free web search. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility. 414–426.
  66. Pierre Wellner. 1991. The DigitalDesk calculator: tangible manipulation on a desk top display. In Proceedings of the 4th annual ACM symposium on User interface software and technology. 27–33.
  67. WikiTUI: leaving digital traces in physical books. In Proceedings of the international conference on Advances in computer entertainment technology. 264–265.
  68. Magpad: a near surface augmented reading system for physical paper and smartphone coupling. In Adjunct Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 103–104.
  69. ConceptEVA: Concept-Based Interactive Exploration and Customization of Document Summaries. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–16.
  70. Design of Paper Book Oriented Augmented Reality Collaborative Annotation System for Science Education. In 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 417–421.
  71. QOOK: enhancing information revisitation for active reading with a paper book. In Proceedings of the 8th International Conference on Tangible, Embedded and Embodied Interaction. 125–132.

Summary

  • The paper introduces RealitySummary, an innovative MR system that combines OCR and GPT-4 for real-time text extraction and dynamic summarization.
  • The system's evaluations demonstrate high OCR accuracy (97.9%) and summarization correctness (96.77%), enhancing document comprehension and navigation.
  • User studies validate the MR approach with positive usability ratings (SUS score: 71) and practical applications across academic and everyday contexts.

On-Demand Mixed Reality Document Enhancement: The RealitySummary System

Introduction

The research paper "RealitySummary: On-Demand Mixed Reality Document Enhancement using LLMs" introduces RealitySummary, a mixed reality (MR) reading assistant designed to enhance printed or digital documents through on-demand text extraction, summarization, and augmentation. Unlike previous augmented reading tools that required pre-processed documents, RealitySummary leverages optical character recognition (OCR) and LLMs to provide real-time document enhancements. This paper presents generalizable techniques for diverse documents, explores system architectures, and evaluates their usability and applicability through user studies.

System Design and Implementation

RealitySummary integrates multiple technologies to extract, analyze, and annotate documents in real-time. The system uses OCR (Google Cloud OCR) to capture textual content from physical and digital media and employs GPT-4 for generating dynamic summaries and augmentations. The MR environment is created using Microsoft HoloLens 2 and Apple Vision Pro, showcasing the system's hardware independence. The design uses a blend of image tracking, spatial canvases, and speech input for intuitive user interactions.

The system presents six types of document augmentations: summaries, comparison tables, timelines, keyword lists, summary highlighting, and information cards. These features aim to transform the reading experience by providing immediate and contextualized content insights without requiring pre-preparation of documents.

Formative Design Study

To ensure the system's utility across various document types, the authors conducted an exploratory design paper. Participants were asked to visualize potential document enhancements, resulting in five overarching categories:

  1. Summarize: Text-based and personalized summaries.
  2. Compare: Dynamic comparisons using tables or visual formats like mind maps.
  3. Augment: Enriching content with external data like images or maps.
  4. Extract: Persistent references via keyword lists or citation extracts.
  5. Navigate: Enhanced document navigation through progress indicators or collapsible headings.

These insights were critical in shaping the RealitySummary design, emphasizing the utility of mixed reality for spatial and tangible interaction with augmented content.

Technical Evaluation

The system's performance evaluation focuses on AR tracking reliability, OCR accuracy, and summarization relevance. The paper reports high OCR accuracy (97.9%) and reliable document tracking, particularly for documents containing visual elements. However, text-only documents underperformed in tracking (64% uptime) due to limited visual features. Summarization was generally precise, with a 96.77% correctness rate across evaluated documents.

Usability Study

A usability paper with twelve participants highlighted RealitySummary's positive reception. Participants found the system intuitive and beneficial for enhancing their comprehension and navigation of documents. They appreciated the combinational use of features like timelines and keyword lists, which assisted in building a structured understanding of content. The paper reported a System Usability Scale (SUS) score of 71, indicating a favorable usability level for a prototype.

In-the-Wild Study

To assess real-world applicability, an in-the-wild paper was conducted, deploying RealitySummary in diverse settings using Apple Vision Pro. The paper revealed numerous everyday applications, ranging from reading academic papers and textbooks to practical uses like interpreting restaurant menus and product labels. The always-on feature was particularly praised for enabling seamless interactions. Nevertheless, participants expressed concerns about privacy, potential over-reliance on AI, and the comfort of MR headsets.

Implications and Future Research

RealitySummary represents a significant step toward practical MR reading assistants. The system's ability to provide contextual and real-time document enhancements addresses the limitations of pre-processed AR systems. However, the research identifies several areas for future exploration:

  • Robust AR Tracking: Exploring advanced image tracking techniques to improve performance in various lighting conditions and for text-only documents.
  • Multimodal Capabilities: Extending capabilities to interpret and summarize visual content, as well as integrating more sophisticated interactions through eye-tracking and broader gestural inputs.
  • Long-Term Usability Studies: Conducting prolonged usage studies to understand the real-world implications on users' reading habits and potential cognitive impacts.
  • Balancing Proactive and On-Demand Features: Further refining the balance between automatic summarization and user-driven inquiries to enhance user experience.

Conclusion

RealitySummary ushers in a new era of mixed reality reading tools, leveraging cutting-edge NLP and OCR technologies to deliver comprehensive document enhancements. By navigating the complexities of real-time information extraction and summarization, RealitySummary exemplifies the potential of MR environments to revolutionize the reading experience. Future advancements in MR hardware and AI models are poised to further enhance the accessibility and applicability of such systems, making intuitive and intelligent reading support an integral part of everyday activities.

X Twitter Logo Streamline Icon: https://streamlinehq.com