Real-time latency in fully wireless VLM earbud pipelines
Establish whether an end-to-end pipeline that captures contextually relevant imagery via dual earbud cameras, streams the data over low-bandwidth Bluetooth, performs on-device vision–language model inference, and synthesizes an audio response can meet real-time latency constraints for interactive queries.
References
Answering user queries (e.g., "Where are my keys?") requires capturing contextually relevant imagery, streaming it via low-bandwidth Bluetooth, performing multimodal inference using an on-device vision–LLM, and synthesizing an audio response. Meeting real-time latency constraints across this end-to-end pipeline remains an open systems challenge.
— VueBuds: Visual Intelligence with Wireless Earbuds
(2603.29095 - Kim et al., 31 Mar 2026) in Introduction, RQ3 bullet