ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

Published 6 Apr 2026 in cs.CV, cs.GR, and cs.HC | (2604.04905v1)

Abstract: We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-LLM (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: nanovis.org/ClickAIXR.html

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a novel on-device multimodal architecture that eliminates network dependency and reduces referential ambiguity using a gaze-locked clipping window.
The paper employs a model-agnostic approach with ONNX Runtime on Magic Leap 2 to perform all perception and language processing locally, ensuring data privacy.
The paper demonstrates acceptable latency benchmarks and lays the groundwork for future XR enhancements, despite lower usability scores compared to mature cloud systems.

On-Device Multimodal Vision-Language Interaction with ClickAIXR in XR

Motivation and Context

ClickAIXR addresses significant limitations in current XR applications integrating Vision-LLMs (VLMs), specifically network dependency, privacy leakage, referential ambiguity, and segmentation overhead. Existing systems such as GazePointAR and XaiR rely on cloud-based inference, exposing sensitive visual data and incurring latency and recurring costs. Moreover, gaze- or voice-based selection mechanisms can introduce ambiguity and inefficiency, particularly in scene-rich XR environments. ClickAIXR proposes an entirely on-device multimodal architecture that performs all perception, object selection, language processing, and speech I/O on XR hardware (Magic Leap 2), offering true data locality and segmentation-free, ambiguity-minimized user interaction (2604.04905).

System Architecture

ClickAIXR comprises a tightly integrated pipeline:

On-device VLM: Utilizes ViT-GPT-2 (VisionEncoderDecoder) for local image-captioning via ONNX Runtime. The system architecture is model-agnostic, supporting further substitution with other ONNX-based VLMs, including instruction-tuned or quantized variants within the device's memory budget.
Object Selection: Introduces the Gaze-Locked Clipping Window (GCW), a user-controllable, gaze-aligned rectangular selection overlay. Unlike prior approaches leveraging instance segmentation (e.g., YOLOv8 in GazePointAR), GCW implements fast, pixel-accurate cropping, minimizing referential ambiguity.
Multimodal Interaction: Input modalities include gaze, controller, speech, and text. Queries are formulated via speech or text, and all ASR/TTS is performed locally using Vosk and built-in TTS, maintaining privacy and predictable latency.
Deployment: Implemented in the Magic Leap 2 SDK (C API), the system supports fully offline operation. It includes two primary interaction modes: (1) dwell auto-capture using a fixed-size GCW, and (2) explicit GCW selection via controller adjustment.

Evaluation and Empirical Results

Latency Benchmarking

Inference latency was measured on both the Book Covers and COCO datasets, using the ViT-GPT-2 model. Excluding model load time, mean inference latency was 5.36 s (Books) and 5.48 s (COCO), with token generation speeds of 3.36 tokens/s. While absolute latency does not reach the 1s threshold recommended for uninterrupted user cognitive flow, it remains well below the 10s interruption benchmark, providing acceptable response times for interactive XR usage—particularly given the absence of network dependence and segmentation overhead. Compared to GazePointAR’s multi-stage, cloud-based pipeline (7.51 s total, with 3.75 s for segmentation), ClickAIXR achieves streamlined processing via its segmentation-free GCW selection (2604.04905).

User Study

A within-subjects experiment with 12 participants compared ClickAIXR to cloud-based Gemini 2.5 Flash and ChatGPT 5. Usability was assessed using the System Usability Scale (SUS) and task-specific reliability/error questionnaires.

SUS Scores: ClickAIXR obtained a mean SUS of 60.0 (SD = 17.1), which, while lower than Gemini (81.9) and ChatGPT (76.7), closely tracked the typical “marginal/OK” band for early XR interfaces and was comparable to GazePointAR's reported SUS (62.1).
Task-specific Reliability: Perceived reliability and error rates were lower for ClickAIXR, attributed to both interface novelty and the use of a captioning model not fine-tuned for instruction-following.
Preference Ranking: Cloud-based systems were preferred (Gemini > ChatGPT > ClickAIXR), primarily due to their maturity and familiarity, rather than intrinsic limitations of ClickAIXR’s on-device architecture.

Analysis and Technical Implications

Explicit Object Selection and Ambiguity Mitigation

A central advance in ClickAIXR is the GCW, which eliminates the need for separate vision segmentation and enables precise, user-driven ROI cropping. This not only reduces pipeline latency but also directly addresses pronoun ambiguity present in gaze/voice-only paradigms. The system’s design ensures the model operates only on the explicitly selected ROI, increasing transparency and user trust, especially in cases where fine-grained object reference is critical (e.g., distinguishing a nose from a face within the same bounding area).

Privacy, Latency, and Deployability

Running all inference locally confers several advantages:

Privacy: No visual or spoken data leaves the device; no third-party cloud exposure.
Latency: Predictable, network-independent processing, immune to bandwidth fluctuations.
Sustainability: Removes recurring costs (subscriptions, high energy usage from cloud compute).
Customizability: Supports local fine-tuning of models for specific scenarios, impossible with most proprietary APIs.

The primary trade-off is hardware-limited model complexity, which constrains the semantic depth and fluency achievable compared to cloud-scale LLMs/VLMs. However, this is mitigated by recent progress in mobile-capable VLM architectures, model quantization, and XR-specific optimization frameworks as established by MobileVLM [7], TinyVLA [12], and benchmarking pipelines like AIvaluateXR [15].

Broader Impact and Future Directions

ClickAIXR demonstrates the technical feasibility and design trade-offs of deploying local, multimodal VLMs in XR settings. Its segmentation-free, gaze/controller-based selection is a scalable design for low-ambiguity, privacy-preserving interaction. As dedicated hardware accelerators and memory capacities proliferate in XR devices, the real-time, on-device deployment of more complex, instruction-tuned VLMs becomes increasingly viable.

Future advances will likely focus on:

Inference acceleration via GPU or dedicated AI processors
Model architecture compactness (quantized/distilled models)
Instruction tuning and XR-specific dataset fine-tuning for enhanced interaction richness and reliability
Expanded input modalities (gesture, scene context integration), and hybrid privacy-preserving client-server approaches for scaling to larger models

Conclusion

ClickAIXR establishes a practical, deployable baseline for multimodal, object-grounded interaction in privacy-critical or connectivity-limited XR applications. Its fully on-device, segmentation-free pipeline reduces system-level latency and referential ambiguity, providing measurable system advantages for certain classes of interaction. While its usability currently lags mature cloud-based assistants, advances in XR hardware and model design are expected to close this gap, positioning ClickAIXR as a foundation for trustworthy, fine-tunable, and efficient multimodal AI in extended reality (2604.04905).