- The paper presents a novel on-device multimodal architecture that eliminates network dependency and reduces referential ambiguity using a gaze-locked clipping window.
- The paper employs a model-agnostic approach with ONNX Runtime on Magic Leap 2 to perform all perception and language processing locally, ensuring data privacy.
- The paper demonstrates acceptable latency benchmarks and lays the groundwork for future XR enhancements, despite lower usability scores compared to mature cloud systems.
On-Device Multimodal Vision-Language Interaction with ClickAIXR in XR
Motivation and Context
ClickAIXR addresses significant limitations in current XR applications integrating Vision-LLMs (VLMs), specifically network dependency, privacy leakage, referential ambiguity, and segmentation overhead. Existing systems such as GazePointAR and XaiR rely on cloud-based inference, exposing sensitive visual data and incurring latency and recurring costs. Moreover, gaze- or voice-based selection mechanisms can introduce ambiguity and inefficiency, particularly in scene-rich XR environments. ClickAIXR proposes an entirely on-device multimodal architecture that performs all perception, object selection, language processing, and speech I/O on XR hardware (Magic Leap 2), offering true data locality and segmentation-free, ambiguity-minimized user interaction (2604.04905).
System Architecture
ClickAIXR comprises a tightly integrated pipeline:
- On-device VLM: Utilizes ViT-GPT-2 (VisionEncoderDecoder) for local image-captioning via ONNX Runtime. The system architecture is model-agnostic, supporting further substitution with other ONNX-based VLMs, including instruction-tuned or quantized variants within the device's memory budget.
- Object Selection: Introduces the Gaze-Locked Clipping Window (GCW), a user-controllable, gaze-aligned rectangular selection overlay. Unlike prior approaches leveraging instance segmentation (e.g., YOLOv8 in GazePointAR), GCW implements fast, pixel-accurate cropping, minimizing referential ambiguity.
- Multimodal Interaction: Input modalities include gaze, controller, speech, and text. Queries are formulated via speech or text, and all ASR/TTS is performed locally using Vosk and built-in TTS, maintaining privacy and predictable latency.
- Deployment: Implemented in the Magic Leap 2 SDK (C API), the system supports fully offline operation. It includes two primary interaction modes: (1) dwell auto-capture using a fixed-size GCW, and (2) explicit GCW selection via controller adjustment.
Evaluation and Empirical Results
Latency Benchmarking
Inference latency was measured on both the Book Covers and COCO datasets, using the ViT-GPT-2 model. Excluding model load time, mean inference latency was 5.36 s (Books) and 5.48 s (COCO), with token generation speeds of 3.36 tokens/s. While absolute latency does not reach the 1s threshold recommended for uninterrupted user cognitive flow, it remains well below the 10s interruption benchmark, providing acceptable response times for interactive XR usage—particularly given the absence of network dependence and segmentation overhead. Compared to GazePointAR’s multi-stage, cloud-based pipeline (7.51 s total, with 3.75 s for segmentation), ClickAIXR achieves streamlined processing via its segmentation-free GCW selection (2604.04905).
User Study
A within-subjects experiment with 12 participants compared ClickAIXR to cloud-based Gemini 2.5 Flash and ChatGPT 5. Usability was assessed using the System Usability Scale (SUS) and task-specific reliability/error questionnaires.
- SUS Scores: ClickAIXR obtained a mean SUS of 60.0 (SD = 17.1), which, while lower than Gemini (81.9) and ChatGPT (76.7), closely tracked the typical “marginal/OK” band for early XR interfaces and was comparable to GazePointAR's reported SUS (62.1).
- Task-specific Reliability: Perceived reliability and error rates were lower for ClickAIXR, attributed to both interface novelty and the use of a captioning model not fine-tuned for instruction-following.
- Preference Ranking: Cloud-based systems were preferred (Gemini > ChatGPT > ClickAIXR), primarily due to their maturity and familiarity, rather than intrinsic limitations of ClickAIXR’s on-device architecture.
Analysis and Technical Implications
Explicit Object Selection and Ambiguity Mitigation
A central advance in ClickAIXR is the GCW, which eliminates the need for separate vision segmentation and enables precise, user-driven ROI cropping. This not only reduces pipeline latency but also directly addresses pronoun ambiguity present in gaze/voice-only paradigms. The system’s design ensures the model operates only on the explicitly selected ROI, increasing transparency and user trust, especially in cases where fine-grained object reference is critical (e.g., distinguishing a nose from a face within the same bounding area).
Privacy, Latency, and Deployability
Running all inference locally confers several advantages:
- Privacy: No visual or spoken data leaves the device; no third-party cloud exposure.
- Latency: Predictable, network-independent processing, immune to bandwidth fluctuations.
- Sustainability: Removes recurring costs (subscriptions, high energy usage from cloud compute).
- Customizability: Supports local fine-tuning of models for specific scenarios, impossible with most proprietary APIs.
The primary trade-off is hardware-limited model complexity, which constrains the semantic depth and fluency achievable compared to cloud-scale LLMs/VLMs. However, this is mitigated by recent progress in mobile-capable VLM architectures, model quantization, and XR-specific optimization frameworks as established by MobileVLM [7], TinyVLA [12], and benchmarking pipelines like AIvaluateXR [15].
Broader Impact and Future Directions
ClickAIXR demonstrates the technical feasibility and design trade-offs of deploying local, multimodal VLMs in XR settings. Its segmentation-free, gaze/controller-based selection is a scalable design for low-ambiguity, privacy-preserving interaction. As dedicated hardware accelerators and memory capacities proliferate in XR devices, the real-time, on-device deployment of more complex, instruction-tuned VLMs becomes increasingly viable.
Future advances will likely focus on:
- Inference acceleration via GPU or dedicated AI processors
- Model architecture compactness (quantized/distilled models)
- Instruction tuning and XR-specific dataset fine-tuning for enhanced interaction richness and reliability
- Expanded input modalities (gesture, scene context integration), and hybrid privacy-preserving client-server approaches for scaling to larger models
Conclusion
ClickAIXR establishes a practical, deployable baseline for multimodal, object-grounded interaction in privacy-critical or connectivity-limited XR applications. Its fully on-device, segmentation-free pipeline reduces system-level latency and referential ambiguity, providing measurable system advantages for certain classes of interaction. While its usability currently lags mature cloud-based assistants, advances in XR hardware and model design are expected to close this gap, positioning ClickAIXR as a foundation for trustworthy, fine-tunable, and efficient multimodal AI in extended reality (2604.04905).