Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoDia: A New Dataset for Romanian Dialect Identification from Speech (2309.03378v3)

Published 6 Sep 2023 in cs.CL, cs.SD, and eess.AS

Abstract: We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Findings of the VarDial evaluation campaign 2022. In Proceedings of VarDial, pages 1–13.
  2. Automatic rating of spontaneous speech for low-resource languages. In Proceedings of SLT, pages 339–345.
  3. The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech. In Proceedings of ASRU, pages 1026–1033.
  4. Speech recognition challenge in the wild: Arabic MGB-3. In Proceedings of ASRU, pages 316–322.
  5. RoSAC: A Speech Corpus for Transcribing Romanian Emergency Calls. In Proceedings of COMM, pages 1–5.
  6. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of NeurIPS, volume 33, pages 12449–12460.
  7. The reference corpus of the contemporary Romanian language (CoRoLa). In Proceedings of LREC, pages 1178–1185.
  8. The NCHLT speech corpus of the South African languages. In Proceedings of SLTU, pages 194–200.
  9. The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of LREC, pages 3387–3396.
  10. Andrei Butnaru and Radu Tudor Ionescu. 2019. MOROCO: The Moldavian and Romanian Dialectal Corpus. In Proceedings of ACL, pages 688–698.
  11. Findings of the VarDial evaluation campaign 2021. In Proceedings of VarDial, pages 1–11.
  12. William Chan and Ian Lane. 2015. Deep convolutional neural networks for acoustic modeling in low resource languages. In Proceedings of ICASSP, pages 2056–2060.
  13. SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German. arXiv preprint arXiv:2103.11401.
  14. ACTIV-ES: A comparable, cross-dialect corpus of everyday Spanish from Argentina, Mexico, and Spain. In Proceedings of LREC, pages 1733–1737.
  15. A report on the VarDial evaluation campaign 2020. In Proceedings of VarDial, pages 1–14.
  16. Mihaela Găman and Radu Tudor Ionescu. 2022. The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification. International Journal of Intelligent Systems, 37(8):4928–4966.
  17. RSC: A Romanian read speech corpus for automatic speech recognition. In Proceedings of LREC.
  18. AST: Audio Spectrogram Transformer. In Proceedings of INTERSPEECH, pages 571–575.
  19. FreCDo: A New Corpus for Large-Scale French Cross-Domain Dialect Identification. In Proceedings of KES, pages 366–373.
  20. Finnish Dialect Identification: The Effect of Audio and Text. In Proceedings of EMNLP, pages 8777–8783.
  21. Deep Residual Learning for Image Recognition. In Proceedings of CVPR, pages 770–778.
  22. Speech Recognition Datasets for Low-resource Congolese Languages. Data in Brief, 52:109796.
  23. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic gradient descent. In Proceedings of ICLR.
  24. Spoken Language Recognition Using Ensemble Classifiers. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2053–2062.
  25. A review of deep learning techniques for speech processing. Information Fusion, page 101869.
  26. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of INTERSPEECH, pages 2613–2617.
  27. STT4SG-350: A speech corpus for all Swiss German dialect regions. In Proceedings of ACL, pages 1763–1772.
  28. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of ICML, pages 28492–28518.
  29. Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11):1–37.
  30. Nicolae-Cătălin Ristea and Radu Tudor Ionescu. 2020. Are you Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In Proceedings of INTERSPEECH, pages 2102–2106.
  31. SepTr: Separable Transformer for Audio Spectrogram Processing. In Proceedings of INTERSPEECH, pages 4103–4107.
  32. ADI17: A Fine-Grained Arabic Dialect Identification Dataset. In Proceedings of ICASSP, pages 8244–8248.
  33. Steven H. Weinberger and Stephen A. Kunath. 2011. The Speech Accent Archive: towards a typology of English accents. In Proceedings of Corpus-based studies in language use, language learning, and language documentation, pages 265–281. Brill.
  34. A report on the third VarDial evaluation campaign. In Proceedings of VarDial, pages 1–16.
  35. AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents. In Proceedings of ISCSLP, pages 76–80.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com