Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FANS: Fusing ASR and NLU for on-device SLU (2111.00400v1)

Published 31 Oct 2021 in cs.CL, cs.SD, and eess.AS

Abstract: Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU). In this paper, we introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio, obviating the need for transcription. FANS consists of a shared audio encoder and three decoders, two of which are seq-to-seq decoders that predict non null slot tags and slot values in parallel and in an auto-regressive manner. FANS neural encoder and decoders architectures are flexible which allows us to leverage different combinations of LSTM, self-attention, and attenders. Our experiments show compared to the state-of-the-art end-to-end SLU models, FANS reduces ICER and IRER errors relatively by 30 % and 7 %, respectively, when tested on an in-house SLU dataset and by 0.86 % and 2 % absolute when tested on a public SLU dataset.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Martin Radfar (17 papers)
  2. Athanasios Mouchtaris (31 papers)
  3. Siegfried Kunzmann (13 papers)
  4. Ariya Rastrow (55 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.