Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model (2310.14946v1)

Published 23 Oct 2023 in cs.MM, cs.SD, and eess.AS

Abstract: We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similarities and differences between languages. To do so, we design a prompt fine-tuning technique into the largely pre-trained audio-visual representation model so that the network can recognize the language class as well as the speech with the corresponding language. Our work contributes to developing robust and efficient multilingual audio-visual speech recognition systems, reducing the need for language-specific models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Joanna Hong (13 papers)
  2. Se Jin Park (15 papers)
  3. Yong Man Ro (91 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.