Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

158 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

123

RealitySummary: Exploring On-Demand Mixed Reality Text Summarization and Question Answering using Large Language Models (2405.18620v2)

Published 28 May 2024 in cs.HC, cs.AI, and cs.CL

Abstract: LLMs are gaining popularity as tools for reading and summarization aids. However, little is known about their potential benefits when integrated with mixed reality (MR) interfaces to support everyday reading assistants. We developed RealitySummary, an MR reading assistant that seamlessly integrates LLMs with always-on camera access, OCR-based text extraction, and augmented spatial and visual responses in MR interfaces. Developed iteratively, RealitySummary evolved across three versions, each shaped by user feedback and reflective analysis: 1) a preliminary user study to understand user perceptions (N=12), 2) an in-the-wild deployment to explore real-world usage (N=11), and 3) a diary study to capture insights from real-world work contexts (N=5). Our findings highlight the unique advantages of combining AI and MR, including an always-on implicit assistant, minimal context switching, and spatial affordances, demonstrating significant potential for future LLM-MR interfaces beyond traditional screen-based interactions.

References (71)

Summary

The paper introduces RealitySummary, an innovative MR system that combines OCR and GPT-4 for real-time text extraction and dynamic summarization.
The system's evaluations demonstrate high OCR accuracy (97.9%) and summarization correctness (96.77%), enhancing document comprehension and navigation.
User studies validate the MR approach with positive usability ratings (SUS score: 71) and practical applications across academic and everyday contexts.

On-Demand Mixed Reality Document Enhancement: The RealitySummary System

Introduction

The research paper "RealitySummary: On-Demand Mixed Reality Document Enhancement using LLMs" introduces RealitySummary, a mixed reality (MR) reading assistant designed to enhance printed or digital documents through on-demand text extraction, summarization, and augmentation. Unlike previous augmented reading tools that required pre-processed documents, RealitySummary leverages optical character recognition (OCR) and LLMs to provide real-time document enhancements. This paper presents generalizable techniques for diverse documents, explores system architectures, and evaluates their usability and applicability through user studies.

System Design and Implementation

RealitySummary integrates multiple technologies to extract, analyze, and annotate documents in real-time. The system uses OCR (Google Cloud OCR) to capture textual content from physical and digital media and employs GPT-4 for generating dynamic summaries and augmentations. The MR environment is created using Microsoft HoloLens 2 and Apple Vision Pro, showcasing the system's hardware independence. The design uses a blend of image tracking, spatial canvases, and speech input for intuitive user interactions.

The system presents six types of document augmentations: summaries, comparison tables, timelines, keyword lists, summary highlighting, and information cards. These features aim to transform the reading experience by providing immediate and contextualized content insights without requiring pre-preparation of documents.

Formative Design Study

To ensure the system's utility across various document types, the authors conducted an exploratory design paper. Participants were asked to visualize potential document enhancements, resulting in five overarching categories:

Summarize: Text-based and personalized summaries.
Compare: Dynamic comparisons using tables or visual formats like mind maps.
Augment: Enriching content with external data like images or maps.
Extract: Persistent references via keyword lists or citation extracts.
Navigate: Enhanced document navigation through progress indicators or collapsible headings.

These insights were critical in shaping the RealitySummary design, emphasizing the utility of mixed reality for spatial and tangible interaction with augmented content.

Technical Evaluation

The system's performance evaluation focuses on AR tracking reliability, OCR accuracy, and summarization relevance. The paper reports high OCR accuracy (97.9%) and reliable document tracking, particularly for documents containing visual elements. However, text-only documents underperformed in tracking (64% uptime) due to limited visual features. Summarization was generally precise, with a 96.77% correctness rate across evaluated documents.

Usability Study

A usability paper with twelve participants highlighted RealitySummary's positive reception. Participants found the system intuitive and beneficial for enhancing their comprehension and navigation of documents. They appreciated the combinational use of features like timelines and keyword lists, which assisted in building a structured understanding of content. The paper reported a System Usability Scale (SUS) score of 71, indicating a favorable usability level for a prototype.

In-the-Wild Study

To assess real-world applicability, an in-the-wild paper was conducted, deploying RealitySummary in diverse settings using Apple Vision Pro. The paper revealed numerous everyday applications, ranging from reading academic papers and textbooks to practical uses like interpreting restaurant menus and product labels. The always-on feature was particularly praised for enabling seamless interactions. Nevertheless, participants expressed concerns about privacy, potential over-reliance on AI, and the comfort of MR headsets.

Implications and Future Research

RealitySummary represents a significant step toward practical MR reading assistants. The system's ability to provide contextual and real-time document enhancements addresses the limitations of pre-processed AR systems. However, the research identifies several areas for future exploration:

Robust AR Tracking: Exploring advanced image tracking techniques to improve performance in various lighting conditions and for text-only documents.
Multimodal Capabilities: Extending capabilities to interpret and summarize visual content, as well as integrating more sophisticated interactions through eye-tracking and broader gestural inputs.
Long-Term Usability Studies: Conducting prolonged usage studies to understand the real-world implications on users' reading habits and potential cognitive impacts.
Balancing Proactive and On-Demand Features: Further refining the balance between automatic summarization and user-driven inquiries to enhance user experience.

Conclusion

RealitySummary ushers in a new era of mixed reality reading tools, leveraging cutting-edge NLP and OCR technologies to deliver comprehensive document enhancements. By navigating the complexities of real-time information extraction and summarization, RealitySummary exemplifies the potential of MR environments to revolutionize the reading experience. Future advancements in MR hardware and AI models are poised to further enhance the accessibility and applicability of such systems, making intuitive and intelligent reading support an integral part of everyday activities.

PDF Markdown

Tweets

https://twitter.com/andy_matuschak/status/1796933899131781582