VisionClaw: Always-On AI Agents through Smart Glasses

Published 3 Apr 2026 in cs.HC, cs.AI, cs.CV, cs.LG, and cs.MA | (2604.03486v2)

Abstract: We present VisionClaw, an always-on wearable AI agent that integrates live egocentric perception with agentic task execution. Running on Meta Ray-Ban smart glasses, VisionClaw continuously perceives real-world context and enables in-situ, speech-driven action initiation and delegation via OpenClaw AI agents. Therefore, users can directly execute tasks through the smart glasses, such as adding real-world objects to an Amazon cart, generating notes from physical documents, receiving meeting briefings on the go, creating events from posters, or controlling IoT devices. We evaluate VisionClaw through a controlled laboratory study (N=12) and a longitudinal deployment study (N=5). Results show that integrating perception and execution enables faster task completion and reduces interaction overhead compared to non-always-on and non-agent baselines. Beyond performance gains, deployment findings reveal a shift in interaction: tasks are initiated opportunistically during ongoing activities, and execution is increasingly delegated rather than manually controlled. These results suggest a new paradigm for wearable AI agents, where perception and action are continuously coupled to support situated, hands-free interaction.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents VisionClaw, an innovative system that fuses always-on egocentric multimodal sensing with autonomous task execution on smart glasses, achieving up to 37% task time reduction.
The system employs a layered architecture combining real-time audio-video processing through Gemini Live with agentic control via OpenClaw for seamless digital interactions.
Experimental studies indicate significant improvements in usability and reduced cognitive load, while also highlighting challenges in privacy and user trust.

VisionClaw: Always-On, Multimodal, Agentic AI via Smart Glasses

System Architecture and Design

VisionClaw (2604.03486) exemplifies the fusion of egocentric multimodal perception and large-scale AI agentic task execution, operationalized through Meta Ray-Ban smart glasses, Gemini Live for real-time multimodal interaction, and OpenClaw for back-end agentic workflows. The system architecture is structured in three layers:

Sensory Input Layer: Captures continuous audio and video (JPEG-compressed, ~1 fps) from Meta Ray-Ban glasses via the DAT SDK through a mobile app, supporting both video and audio-only modes to optimize power and network consumption.
Multimodal AI Layer: Employs the Gemini Live API, where the gemini-2.5-flash-native-audio-preview model directly processes raw audio and interleaved video streams, handling contextual speech understanding and low-latency, spoken feedback for natural, conversational interaction.
Agentic Execution Layer: All function calls initiated by the LLM are routed to OpenClaw, a modular agentic platform with skills for web automation, messaging, memory retrieval, and device control. OpenClaw executes multi-step digital tasks and communicates completion back to Gemini, which then generates natural spoken confirmation.
Figure 1: End-to-end pipeline integrating persistent egocentric sensing with agentic execution via smart glasses, Gemini Live, and OpenClaw.

This architecture allows for seamless, screenless delegation of open-ended real-world tasks (e.g., "add this product to Amazon," "draft an email from this document") directly from physical context, without the user needing to switch devices or explicitly describe their surroundings.

Experimental Evaluation: Laboratory Study

A within-subjects laboratory study (N=12) benchmarked VisionClaw (Always-On + Agent) against two baselines:

Always-On Only: Smart glasses with Gemini Live for perception and QA, but no autonomous execution.
Agent Only: Smartphone-based OpenClaw agent with no real-world perception, requiring users to describe context explicitly.

Tasks were grounded in everyday physical artifacts (receipts, papers, books, IoT devices) and included note taking, email composition, product lookup, and device control.

Figure 2: Four representative physical-digital tasks utilized in the user study, encompassing note taking, email drafting, product search/shop, and smart device control.

Results demonstrated that VisionClaw achieves up to 37% reduction in task completion time and 46% reduction in perceived difficulty over baselines. Notably, in email composition, VisionClaw's median completion time was 105.7s vs. 216.4s for Always-On Only and 131.1s for Agent Only. NASA-TLX analysis confirmed significant reduction in mental and temporal demand, as well as user frustration ( $p < 0.05$ ).

Figure 3: Task completion times; VisionClaw consistently outperforms baselines, with strong statistical significance in text-heavy tasks.

Figure 4: NASA-TLX workload scores highlight significant reductions in cognitive load and frustration with VisionClaw.

Self-authored Likert ratings indicated improvements in perceived control and usefulness, although trust and reliability differences were nonsignificant. Some users remained hesitant to fully delegate high-stakes tasks without explicit verification.

Figure 5: Self-report assessments; VisionClaw is rated as more useful and controllable for integrated, real-world tasks.

Longitudinal Deployment Study: Emergent Patterns and Use Cases

To surface long-term and in-situ interaction dynamics, an autobiographical deployment study (N=4, 555 interactions, 25.8 hours) categorized usage across six archetypes: Communicate, Retrieve, Save, Recall, Shop, and Control.

Figure 6: Exemplars for each primary use case, demonstrating agentic task execution grounded in visual context.

Interaction logs indicate users averaged 10.1 voice-initiated commands per day, with 39% leveraging visual context from the camera. Sessions were distributed throughout daily activity, substantiating the system's utility for ambient, always-on interaction.

Figure 7: Temporal visualization of task session distributions across use case categories in real-world deployment.

Notable emergent properties included:

Multi-turn, Open-Ended Conversation: Users naturally chained queries and actions, blurring the segmentation between information request, memory retrieval, and autonomous task execution.
Opportunistic Capture and Recall: Information was captured and recalled spontaneously, leveraging persistent egocentric context without deliberate device interaction.
Screenless, Calm Interaction: Delegating tasks via voice rather than a phone reduced cognitive load, though trust and the need for explicit task confirmation remain active concerns.
Interaction Evolution: As more personal data and agentic skills became available, users increasingly integrated agentic workflows into routine behavior.
Figure 8: Four recurring patterns observed: chained conversation, opportunistic memory actions, computation in the periphery, and increasing utility with richer personal data.

A comprehensive taxonomy of interaction scenarios, detailed in the appendix, illustrates broad domain transferability (e.g., academic, domestic, retail, and personal productivity).

Figure 9: Taxonomy of observed use case categories with illustrated application scenarios.

Discussion: Theoretical Implications and Future Directions

VisionClaw's integration of always-on egocentric perception and agentic task execution precipitates several key theoretical and practical shifts:

Agentic HCI: The transition from command-based, discrete interactions to continuous, context-adaptive engagement challenges existing mental models for device operation and agent delegation, especially as system capabilities emerge and evolve dynamically rather than being statically enumerated.
Privacy and Data Sovereignty: Persistent perceptual capture, coupled with autonomous action, escalates privacy stakes, particularly for bystanders. Current deployments mitigate some risks by localizing data, but large-scale public adoption demands robust policy and technical interventions.
Lifelong Multimodal Memory: As egocentric perceptual memory systems mature, selective curation, data compression, and ethically-guided forgetting become primary concerns—raising fundamental HCI and cognitive science questions.
Model Architecture: The multi-model architecture (fast, streaming Gemini Live for perception+dialogue, slower OpenClaw for composition/tool-use), while effective, highlights opportunities for unifying personal memory, multimodal context, and agentic execution under a more tightly-integrated continual learning paradigm.
Proactivity and Ambient Feedback: The next frontier is proactive, context-aware agentic intervention, and leveraging AR for embedded, unobtrusive visual feedback. Current reactivity and voice-only output limit agent capabilities for seamless, non-intrusive assistance.
Figure 10: Suggested future research trajectories: deployment at scale, proactive agentic workflows, and embedded AR feedback.

Conclusion

VisionClaw demonstrates the feasibility and advantages of tightly coupling persistent egocentric perception with general-purpose agentic task execution in wearable platforms. Empirical results show strong performance and user experience gains in screenless, situated contexts and broad real-world applicability across domains. The findings elucidate fundamental shifts in habitual HCI and agent design, while also foregrounding critical challenges in privacy, model integration, memory architectures, and scalable deployment. This work constitutes a significant step in understanding and enabling ubiquitous, agent-driven multimodal computing.

Markdown Report Issue