Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Retrieval Augmented End-to-End Spoken Dialog Models (2402.01828v1)

Published 2 Feb 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: We recently developed SLM, a joint speech and LLM, which fuses a pretrained foundational speech model and a LLM, while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain domain-specific entities, i.e., restaurants, hotels, train stations, and city names, which are difficult to recognize, however, critical for the downstream applications. Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness. We first train a speech retriever to retrieve text entities mentioned in the audio. The retrieved entities are then added as text inputs to the underlying SLM to bias model predictions. We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance, achieving joint goal accuracy (38.6% vs 32.7%), slot error rate (20.6% vs 24.8%) and ASR word error rate (5.5% vs 6.7%). While demonstrated on dialog state tracking, our approach is broadly applicable to other speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. M. Wang, W. Han et al., “Slm: Bridge the thin gap between speech and text foundation models,” in ASRU, 2023.
  2. P. Lewis, E. Perez, A. Piktus, F. Petroni et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Proc. NeurIPS, 2020.
  3. K. Guu, K. Lee et al., “Retrieval augmented language model pre-training,” in Proc. ICML.   PMLR, 2020.
  4. V. Karpukhin et al., “Dense passage retrieval for open-domain question answering,” in Proc. EMNLP, 2020.
  5. S. Borgeaud, A. Mensch et al., “Improving language models by retrieving from trillions of tokens,” CoRR, 2021.
  6. G. Izacard, P. Lewis, M. Lomeli et al., “Atlas: Few-shot learning with retrieval augmented language models,” arXiv, 2022.
  7. J. Zhao, R. Gupta et al., “Description-driven task-oriented dialog modeling,” arXiv, 2022.
  8. H. Soltau, I. Shafran et al., “Speech aware dialog system technology challenge (dstc11),” in Proc. Interspeech, 2023.
  9. M. Eric, R. Goel et al., “Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines,” CoRR, 2019.
  10. J. Ni, C. Qu, J. Lu et al., “Large dual encoders are generalizable retrievers,” in Proc. EMNLP, 2022.
  11. Y. Gao, Y. Xiong et al., “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023.
  12. L. Huang, W. Yu et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023.
  13. M. Yasunaga, A. Aghajanyan et al., “Retrieval-augmented multimodal language modeling,” 2023.
  14. D. M. Chan, S. Ghosh et al., “Using external off-policy speech-to-text mappings in contextual end-to-end automated speech recognition,” arXiv preprint arXiv:2301.02736, 2023.
  15. Y. Yuan, H. Liu et al., “Retrieval-augmented text-to-audio generation,” arXiv preprint arXiv:2309.08051, 2023.
  16. R. Lin, S. Liu et al., “Hierarchical recurrent neural network for document modeling,” in EMNLP, 2015, pp. 899–907.
  17. D. Zhao, T. N. Sainath et al., “Shallow-fusion end-to-end contextual biasing,” in Interspeech, 2019.
  18. B. Liu and I. Lane, “Dialog context language modeling with recurrent neural networks,” in ICASSP.   IEEE, 2017, pp. 5715–5719.
  19. R. Jonson, “Dialogue context-based re-ranking of asr hypotheses,” in 2006 IEEE Spoken Language Technology Workshop.   IEEE, 2006, pp. 174–177.
  20. A. Raju, B. Hedayatnia, L. Liu, A. Gandhe, C. Khatri, A. Metallinou, A. Venkatesh, and A. Rastrow, “Contextual language model adaptation for conversational agents,” arXiv preprint arXiv:1806.10215, 2018.
  21. A. Jaech and M. Ostendorf, “Personalized language model for query auto-completion,” arXiv preprint arXiv:1804.09661, 2018.
  22. S. Kim and F. Metze, “Dialog-context aware end-to-end speech recognition,” in SLT.   IEEE, 2018, pp. 434–440.
  23. I. Williams, A. Kannan et al., “Contextual speech recognition in end-to-end neural network systems using beam search.” in Interspeech, 2018, pp. 2227–2231.
  24. C.-S. Wu, A. Madotto et al., “Transferable multi-domain state generator for task-oriented dialogue systems,” in Proc. ACL, 2019.
  25. L. Zhou and K. Small, “Multi-domain dialogue state tracking as dynamic knowledge graph enhanced question answering,” CoRR, 2019.
  26. A. Rastogi, X. Zang et al., “Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset,” Proc. AAAI, 2020.
  27. Y. Zhang, W. Han, J. Qin et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” in arXiv, 2023.
  28. A. Vaswani, N. Shazeer et al., “Attention is all you need,” in Proc. NeurIPS, 2017.
  29. H. W. Chung, L. Hou et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  30. C. Raffel, N. Shazeer et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, 2020.
  31. R. Guo, P. Sun et al., “Accelerating large-scale inference with anisotropic vector quantization,” in Proc. ICML, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Mingqiu Wang (20 papers)
  2. Izhak Shafran (30 papers)
  3. Hagen Soltau (19 papers)
  4. Wei Han (202 papers)
  5. Yuan Cao (201 papers)
  6. Dian Yu (78 papers)
  7. Laurent El Shafey (15 papers)
Citations (9)
X Twitter Logo Streamline Icon: https://streamlinehq.com