Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering (2401.13463v3)

Published 24 Jan 2024 in cs.CL, cs.IR, cs.SD, and eess.AS

Abstract: Spoken Question Answering (SQA) is essential for machines to reply to user's question by finding the answer span within a given spoken passage. SQA has been previously achieved without ASR to avoid recognition errors and Out-of-Vocabulary (OOV) problems. However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered. This paper proposes the first known end-to-end framework, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. SpeechDPR learns a sentence-level semantic representation by distilling knowledge from the cascading model of unsupervised ASR (UASR) and text dense retriever (TDR). No manually transcribed speech data is needed. Initial experiments showed performance comparable to the cascading model of UASR and TDR, and significantly better when UASR was poor, verifying this approach is more robust to speech recognition errors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Jordi Turmo et al., “Overview of QAST 2008,” in Workshop of the Cross-Language Evaluation Forum for European Languages, 2008.
  2. Pere R Comas et al., “Sibyl, a factoid question-answering system for spoken documents,” ACM Transactions on Information Systems (TOIS), 2012.
  3. Chia-Hsuan Lee et al., “Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension,” Proc. Interspeech 2018.
  4. Chia-Hsuan Lee et al., “ODSQA: Open-Domain Spoken Question Answering Dataset,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018.
  5. Danqi Chen et al., “Reading Wikipedia to Answer Open-Domain Questions,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017.
  6. Wei Yang et al., “End-to-End Open-Domain Question Answering with BERTserini,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019.
  7. Vladimir Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  8. Alexei Baevski et al., “Unsupervised speech recognition,” Advances in Neural Information Processing Systems, 2021.
  9. Alexander H Liu et al., “Towards end-to-end unsupervised speech recognition,” in 2022 IEEE Spoken Language Technology Workshop (SLT).
  10. Yung-Sung Chuang et al., “SpeechBERT: An Audio-and-Text Jointly Learned Language Model for End-to-End Spoken Question Answering,” Proc. Interspeech 2020, 2020.
  11. Yu-An Chung et al., “SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
  12. Guan-Ting Lin et al., “DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning,” arXiv e-prints, 2022.
  13. Kushal Lakhotia et al., “On Generative Spoken Language Modeling from Raw Audio,” Transactions of the Association for Computational Linguistics, 2021.
  14. Ann Lee et al., “Direct speech-to-speech translation with discrete units,” arXiv preprint arXiv:2107.05604, 2021.
  15. Lin-shan Lee et al., “Spoken content retrieval—beyond cascading speech recognition with text retrieval,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015.
  16. John S Garofolo et al., “The TREC Spoken Document Retrieval Track: A Success Story.,” NIST SPECIAL PUBLICATION SP, 2000.
  17. Martha Larson et al., “Spoken content retrieval: A survey of techniques and technologies,” Foundations and Trends® in Information Retrieval, 2012.
  18. Ciprian Chelba et al., “Retrieval and browsing of spoken content,” IEEE Signal Processing Magazine, 2008.
  19. “Speech recognition and information retrieval: Experiments in retrieving spoken documents,” in Proceedings of the DARPA speech recognition workshop, 1997.
  20. Dhananjay Ram et al., “Neural network based end-to-end query by example spoken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.
  21. “Query-by-example spoken term detection using attention-based multi-hop networks,” in 2018 IEEE ICASSP, 2018.
  22. David Harwath et al., “Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech,” in 2018 IEEE ICASSP, 2018.
  23. Herman Kamper et al., “Semantic speech retrieval with a visually grounded model of untranscribed speech,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018.
  24. Lee Xiong et al., “Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval,” in International Conference on Learning Representations, 2020.
  25. Yinhan Liu et al., “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  26. “Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models’ Transferability, author=Kao, Wei-Tsung and Lee, Hung-Yi,” in EMNLP 2021, 2021.
  27. Suwon Shon et al., “SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks,” arXiv preprint arXiv:2212.10525, 2022.
  28. Timo Baumann et al., “The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening,” Language Resources and Evaluation, 2019.
  29. Mandar Joshi et al., “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017.
  30. Pranav Rajpurkar et al., “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
  31. Tom Kwiatkowski et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, 2019.
  32. Jonathan Berant et al., “Semantic parsing on freebase from question-answer pairs,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013.
  33. Petr Baudiš et al., “Modeling of the question answering task in the yodaqa system,” in Experimental IR Meets Multilinguality, Multimodality, and Interaction, CLEF’15, 2015.
  34. Jacob Kahn et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020.
  35. Pengcheng He et al., “DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION,” in International Conference on Learning Representations, 2020.
  36. Michael McAuliffe et al., “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi.,” 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chyi-Jiunn Lin (6 papers)
  2. Guan-Ting Lin (21 papers)
  3. Yung-Sung Chuang (37 papers)
  4. Wei-Lun Wu (2 papers)
  5. Shang-Wen Li (55 papers)
  6. Abdelrahman Mohamed (59 papers)
  7. Hung-yi Lee (325 papers)
  8. Lin-Shan Lee (42 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com