DESAMO: On-Device Audio LLM for Elderly Homes
- DESAMO is an on-device smart home system that integrates Audio LLMs to process raw audio for nuanced intent detection and emergency alerts.
- It employs a unified architecture with a 16-bit audio encoder and a 4-bit LLM, ensuring real-time and resource-efficient processing on edge devices.
- By bypassing traditional ASR pipelines, DESAMO reduces transcription errors and bolsters privacy for elder users through complete on-device inference.
DESAMO is an on-device smart home system for elder-friendly environments utilizing embedded audio LLMs (Audio LLMs) to enable natural, private, and robust multimodal interaction. Unlike conventional systems that rely on automatic speech recognition (ASR) pipelines for speech transcription or ASR-to-LLM cascades, DESAMO processes raw audio—including both speech and non-speech signals—directly, supporting nuanced intent detection and critical event recognition within a unified architecture (Choi et al., 26 Aug 2025).
1. System Architecture and Audio Processing Pipeline
DESAMO is built atop the Qwen2.5-Omni 3B model, employing an audio encoder derived from the Whisper large-v3 architecture to process raw waveforms. The encoder function, denoted as , transforms the input audio waveform into a semantic embedding . These embeddings encapsulate pertinent acoustic and contextual information, supporting discrimination between speech and complex environmental sounds.
All model execution occurs on local edge hardware (specifically, NVIDIA Jetson Orin Nano) using the GGUF compact format. The architecture comprises a 16-bit quantized audio encoder and a 4-bit quantized LLM, which together ensure real-time, on-device inference and tightly bound resource requirements. This configuration minimizes latency and avoids transmission of sensitive data, crucial for privacy preservation in elder-centered applications.
2. Voice Intent Classification and Function Call Mapping
DESAMO directly maps user utterances to system actions by extending the function calling paradigm to the audio domain. When a trigger phrase is detected, the system captures a short audio window, encodes it as a semantic embedding, and conditions the Audio LLM with an appropriate prompt. The LLM’s output is a structured control object representing an actionable function call. For instance, spoken commands such as “call my daughter” are translated directly into system calls of the form Call('daughter')
, bypassing the need for intermediate transcription and classical natural language parsing.
This direct semantic mapping reduces error propagation due to ASR inaccuracies—particularly critical when handling unclear or indirect speech patterns prevalent among elderly users. Furthermore, by generating structured outputs, DESAMO provides a deterministic interface for smart home device control and external service integration.
3. Emergency and Environmental Event Detection
In addition to explicit voice commands, DESAMO performs continuous, real-time monitoring of ambient audio for emergency detection. Periodic short audio segments are processed with dedicated prompts to the Audio LLM to identify critical events, including falls and distress cries. The model produces structured alerts such as Alert('fall')
or Alert('help')
; these are directly linked to responsive actions, such as activating alarms or sending notifications to caregivers.
The unified architecture for both user command and anomaly detection is informed by semantic and acoustic cues present in the raw audio, allowing the system to robustly distinguish between routine interactions and emergency signals—even in the presence of background noise or ambiguous acoustic environments.
4. Addressing Limitations of ASR-Based Pipelines
Conventional voice assistants employ ASR pipelines that transcribe speech before interpretation. This setup frequently suffers from two major challenges: propagation of transcription errors—aggravated by slurred or unclear speech common in elderly populations—and restriction to speech-based interaction, precluding detection of non-speech audio events.
DESAMO’s direct Audio LLM approach circumvents both limitations by jointly reasoning over the raw acoustic input. Speech and non-speech signals are both embedded and interpreted in a single, end-to-end pass, fundamentally expanding the range of detectable events and substantially mitigating cascading error from upstream ASR components.
5. Privacy and Adaptation for Elderly Users
All acoustic signal processing, embedding, and LLM inference are performed strictly on-device. No audio data are transmitted to external servers or cloud backends, ensuring maximal privacy—a foundational criterion for acceptance in eldercare scenarios where concerns regarding surveillance and data misuse are substantial.
The system is explicitly designed for natural, indirect, and even ambiguous user interactions. By directly interpreting intent from semantic embeddings (rather than depending on precise or repeated phrasings), DESAMO reduces the interaction burden and need for user re-training, thus supporting accessibility and adoption among populations with speech or cognitive impairments.
6. Hardware Deployment, Model Quantization, and Efficiency
The deployment targets resource-constrained edge devices, exemplified by the NVIDIA Jetson Orin Nano. The model utilizes the GGUF compact format; quantization is applied such that the audio encoder operates at 16-bit precision, with the LLM component reduced to 4-bit precision. This hardware-efficient design ensures real-time operation with bounded energy and computational resource consumption. The strictly on-device pipeline also eliminates network-induced latency and failure modes.
7. Summary and Impact
DESAMO exemplifies the integration of cutting-edge Audio LLM technologies into privacy-preserving, on-device smart home systems for elder-friendly environments. By bypassing the limitations of ASR-based workflows, directly embedding and interpreting both speech and non-speech audio, and unifying user intent and contextual abnormality detection in a low-resource package, DESAMO provides a robust platform for naturalistic and adaptive smart home interaction (Choi et al., 26 Aug 2025).