Real-time latency in fully wireless VLM earbud pipelines

Establish whether an end-to-end pipeline that captures contextually relevant imagery via dual earbud cameras, streams the data over low-bandwidth Bluetooth, performs on-device vision–language model inference, and synthesizes an audio response can meet real-time latency constraints for interactive queries.

Background

The target interaction loop involves wake-word invocation, image capture from binocular ear-level cameras, Bluetooth streaming to a host device, on-device vision–LLM processing, and text-to-speech output.

Meeting responsiveness expectations for conversational interfaces requires tight latency budgets, but Bluetooth’s bandwidth limits and on-device multimodal inference introduce multiple bottlenecks, which the authors identify as an open systems challenge.

References

Answering user queries (e.g., "Where are my keys?") requires capturing contextually relevant imagery, streaming it via low-bandwidth Bluetooth, performing multimodal inference using an on-device vision–LLM, and synthesizing an audio response. Meeting real-time latency constraints across this end-to-end pipeline remains an open systems challenge.

VueBuds: Visual Intelligence with Wireless Earbuds  (2603.29095 - Kim et al., 31 Mar 2026) in Introduction, RQ3 bullet