Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset (2201.02419v2)

Published 7 Jan 2022 in cs.CL, cs.SD, and eess.AS

Abstract: Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by AI. In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Tiezheng Yu (29 papers)
  2. Rita Frieske (11 papers)
  3. Peng Xu (357 papers)
  4. Samuel Cahyawijaya (75 papers)
  5. Cheuk Tung Shadow Yiu (2 papers)
  6. Holy Lovenia (30 papers)
  7. Wenliang Dai (24 papers)
  8. Elham J. Barezi (13 papers)
  9. Qifeng Chen (187 papers)
  10. Xiaojuan Ma (74 papers)
  11. Bertram E. Shi (28 papers)
  12. Pascale Fung (151 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.