Speech Understanding on Tiny Devices with A Learning Cache (2311.18188v4)
Abstract: This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users.
- nvidia/teams/nemo/models/slu_conformer_transformer_large_slurp.
- https://github.com/Azure-Samples/cognitive-services-speech-sdk.
- https://github.com/Picovoice/rhino.
- https://huggingface.co/speechbrain/slu-direct-fluent-speech-commands-librispeech-asr .
- https://www.demandsage.com/voice-search-statistics/.
- https://www.nltk.org/api/nltk.tokenize.html.
- https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Whitepaper-The-future-of-AI-is-hybrid-Part-2-Qualcomm-is-uniquely-positioned-to-scale-hybrid-AI.pdf.
- www.st.com/content/st_com/en/arm-32-bit-microcontrollers/arm-cortex-m7.html.
- www.st.com/en/microcontrollers-microprocessors/stm32f769ni.html.
- https://github.com/AIWintermuteAI/Speech-to-Intent-Micro.
- Azure hybrid benefit. 2023. Accessed: 2023-11-18.
- 2023. Accessed: 2023-7-26.
- Music, search, and iot: How people (really) use voice assistants. ACM Trans. Comput. Hum. Interact., 26(3):17–1, 2019.
- Anonymous. Turbocharge deep speech understanding on the edge, 2024. (Reviewers: The paper was shared with the PC chairs).
- Convolutional recurrent neural networks for small-footprint keyword spotting. arXiv preprint arXiv:1703.05390, 2017.
- Universlu: Universal spoken language understanding for diverse classification and sequence generation tasks with a single network. arXiv preprint arXiv:2310.02973, 2023.
- Unsupervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839, 2021.
- Slurp: A spoken language understanding resource package. arXiv preprint arXiv:2011.13205, 2020.
- Understanding the long-term use of smart speaker assistants. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(3):1–24, 2018.
- Small-footprint keyword spotting using deep neural networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4087–4091. IEEE, 2014.
- Query-by-example keyword spotting using long short-term memory networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5236–5240. IEEE, 2015.
- Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909, 2019.
- Splat: Speech-language joint pre-training for spoken language understanding. arXiv preprint arXiv:2010.02295, 2020.
- Learning feature representations with k-means. In Neural Networks: Tricks of the Trade: Second Edition, pages 561–580. Springer, 2012.
- Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190, 2018.
- Malcolm Coulthard. Author identification, idiolect, and linguistic uniqueness. Applied linguistics, 25(4):431–447, 2004.
- Very deep convolutional neural networks for raw waveforms. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 421–425. IEEE, 2017.
- Pretrained semantic speech embeddings for end-to-end spoken language understanding via cross-modal teacher-student learning. arXiv preprint arXiv:2007.01836, 2020.
- He is just like me: a study of the long-term use of smart speakers by parents and children. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–24, 2020.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
- From audio to semantics: Approaches to end-to-end spoken language understanding. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 720–726. IEEE, 2018.
- Query-by-example spoken term detection using phonetic posteriorgram templates. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pages 421–426. IEEE, 2009.
- Can chatgpt detect intent? evaluating large language models for spoken language understanding. arXiv preprint arXiv:2305.13512, 2023.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Leveraging pretrained asr encoders for effective and efficient end-to-end speech intent classification and slot filling. arXiv preprint arXiv:2307.07057, 2023.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- 2q: a low overhead high performance bu er management replacement algorithm. In Proceedings of the 20th International Conference on Very Large Data Bases, pages 439–450. Citeseer, 1994.
- Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, Upper Saddle River, N.J., 2009.
- Query-by-example on-device keyword spotting. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 532–538. IEEE, 2019.
- Delays in human-computer interaction and their effects on brain activity. PloS one, 11(1):e0146250, 2016.
- A multispeaker dataset of raw and reconstructed speech production real-time mri video and 3d volumetric images. Scientific data, 8(1):187, 2021.
- Joint online spoken language understanding and language modeling with recurrent neural networks. arXiv preprint arXiv:1609.01462, 2016.
- Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
- Ai and design opportunities for smart speakers. 2023.
- Deep spoken keyword spotting: An overview. IEEE Access, 10:4169–4199, 2021.
- Donut: Ctc-based query-by-example keyword spotting. arXiv preprint arXiv:1811.10736, 2018.
- Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670, 2019.
- Personalized speech recognition on mobile devices. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5955–5959. IEEE, 2016.
- The conversational interface, volume 6. Springer, 2016.
- A low latency asr-free end to end spoken language understanding system. arXiv preprint arXiv:2011.04884, 2020.
- Small-footprint keyword spotting on raw audio data with sinc-convolutions. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7454–7458. IEEE, 2020.
- Towards computational offloading in mobile device clouds. In 2013 IEEE 5th international conference on cloud computing technology and science, volume 1, pages 331–338. IEEE, 2013.
- Accelerating smart speaker service with content prefetching and local control. In 2020 IEEE 17th Annual Consumer Communications & Networking Conference (CCNC), pages 1–6. IEEE, 2020.
- Efficient keyword spotting using time delay neural networks. arXiv preprint arXiv:1807.04353, 2018.
- Analysis of cnn-based speech recognition system using raw speech as input. Technical report, Idiap, 2015.
- End-to-end architectures for asr-free spoken language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7974–7978. IEEE, 2020.
- Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning, pages 17627–17643. PMLR, 2022.
- Voice interfaces in everyday life. In proceedings of the 2018 CHI conference on human factors in computing systems, pages 1–12, 2018.
- A streaming end-to-end framework for spoken language understanding. arXiv preprint arXiv:2105.10042, 2021.
- A stack-propagation framework with token-level intent detection for spoken language understanding. arXiv preprint arXiv:1909.02188, 2019.
- A survey on spoken language understanding: Recent advances and new frontiers. arXiv preprint arXiv:2103.03095, 2021.
- Fans: Fusing asr and nlu for on-device slu. arXiv preprint arXiv:2111.00400, 2021.
- A Rajagopal and V Nirmala. Convolutional gated mlp: Combining convolutions & gmlp. arXiv preprint arXiv:2111.03940, 2021.
- Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT), pages 1021–1028. IEEE, 2018.
- Personalized predictive asr for latency reduction in voice assistants. arXiv preprint arXiv:2305.13794, 2023.
- “hey alexa, what’s up?”: studies of in-home conversational agent usage,”. In Proceedings of the DIS.
- Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5754–5758. IEEE, 2018.
- Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
- Improving end-to-end speech-to-intent classification with reptile. arXiv preprint arXiv:2008.01994, 2020.
- Speech and speaker recognition for home automation: Preliminary results. In 2015 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1–10. IEEE, 2015.
- Chasing the metric: Smoothing learning algorithms for keyword detection. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3301–3305. IEEE, 2014.
- Learning to listen… on-device: Present and future perspectives of on-device asr. GetMobile: Mobile Computing and Communications, 23(4):5–9, 2020.
- Cha: A caching framework for home-based voice assistant systems. In 2020 IEEE/ACM Symposium on Edge Computing (SEC), pages 293–306, 2020.
- Cha: A caching framework for home-based voice assistant systems. In 2020 IEEE/ACM Symposium on Edge Computing (SEC), pages 293–306. IEEE, 2020.
- Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128, 2017.
- Su Zhu and Kai Yu. Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5675–5679. IEEE, 2017.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.