Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation (2310.14278v2)

Published 22 Oct 2023 in cs.SD, cs.CL, and eess.AS

Abstract: Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kun Wei (23 papers)
  2. Bei Li (51 papers)
  3. Hang Lv (15 papers)
  4. Quan Lu (13 papers)
  5. Ning Jiang (177 papers)
  6. Lei Xie (337 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.