Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wav2Gloss: Generating Interlinear Glossed Text from Speech (2403.13169v2)

Published 19 Mar 2024 in cs.CL

Abstract: Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free translations to a majority language. We propose Wav2Gloss: a task in which these four annotation components are extracted automatically from speech, and introduce the first dataset to this end, Fieldwork: a corpus of speech with all these annotations, derived from the work of field linguists, covering 37 languages, with standard formatting, and train/dev/test splits. We provide various baselines to lay the groundwork for future research on IGT generation from speech, such as end-to-end versus cascaded, monolingual versus multilingual, and single-task versus multi-task approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Evangelia Adamou. 2015. A corpus-driven analysis of romani in contact with turkish and greek. In Language Variation - European Perspectives V, volume 17. John Benjamins Publishing Company, The Netherlands.
  2. Shirin Adibifar. 2016. Multi-CAST Persian.
  3. End-to-end automatic speech recognition: Its impact on the workflowin documenting yoloxóchitl Mixtec. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 64–80, Online. Association for Computational Linguistics.
  4. Xls-r: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296.
  5. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  6. Multi-CAST Tabasaran.
  7. Slavische Mikrosprachen Im Absoluten Sprachkontakt: Glossierte Und Interpretierte Sprachaufnahmen Aus Italien, Deutschland, Österreich Und Griechenland. Teil I: Moliseslavische Texte Aus Acquaviva Collecroce, Montemitro Und San Felice Del Molise, 1 edition. Harrassowitz Verlag.
  8. Timothy Brickell. 2016. Multi-CAST Tondano.
  9. Hennie Brugman and Albert Russel. 2004. Annotating multi-media/multi-modal resources with ELAN. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
  10. INEL Selkup Corpus.
  11. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  12. Improving massively multilingual asr with auxiliary ctc objectives. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  13. Multi-source cross-lingual model transfer: Learning what to share. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3098–3112, Florence, Italy. Association for Computational Linguistics.
  14. DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation. In Proc. Interspeech 2020, pages 1803–1807.
  15. Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 521–527. IEEE.
  16. The leipzig glossing rules: Conventions for interlinear morpheme-by-morpheme glosses.
  17. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  18. Andrew Cowell. 2022. Arapaho DoReCo dataset.
  19. Towards the automatic processing of Yongning Na (Sino-Tibetan): developing a ’light’ acoustic model of the target language and testing ’heavyweight’ models from five national languages. In 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU 2014), pages 153–160, St Petersburg, Russia.
  20. Chris Lasse Däbritz and Valentin Gusev. 2021. INEL Evenki Corpus.
  21. INEL Dolgan Corpus.
  22. Christian Döhler. 2022. Komnzo DoReCo dataset.
  23. Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1096–1103. IEEE.
  24. Diana Forker and Nils N. Schiborr. 2019. Multi-CAST Sanzhi Dargwa.
  25. Findings of the SIGMORPHON 2023 shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 186–201, Toronto, Canada. Association for Computational Linguistics.
  26. Recent developments on espnet toolkit boosted by conformer. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5874–5878. IEEE.
  27. INEL Kamas Corpus.
  28. N||ng DoReCo dataset.
  29. Geoffrey Haig and Stefan Schnell. 2015. Annotations using graid:(grammatical relations and animacy in discourse): Manual version 7.0.
  30. Geoffrey Haig and Stefan Schnell, editors. 2022. Multi-CAST. University of Bamberg, Bamberg. Version 2211.
  31. Multi-CAST Northern Kurdish.
  32. Andrew Harvey. 2022. Gorwaa DoReCo dataset.
  33. SigMoreFun submission to the SIGMORPHON shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 209–216, Toronto, Canada. Association for Computational Linguistics.
  34. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  35. Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration. In Proc. Interspeech 2021, pages 1529–1533.
  36. Soung-U Kim. 2022. Jejuan DoReCo dataset.
  37. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4835–4839. IEEE.
  38. Yukinori Kimoto. 2019. Multi-CAST Arta.
  39. Manfred Krifka. 2022. Daakie DoReCo dataset.
  40. Keita Kurabe. 2021. Multi-CAST Jinghpaw.
  41. William D. Lewis and Fei Xia. 2010. Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages. Literary and Linguistic Computing, 25(3):303–319.
  42. Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation. ACM Transactions on Asian and Low-Resource Language Information Processing, 20(2):1–15.
  43. Chenxi Meng. 2016. Multi-CAST Tulil.
  44. Semi-supervised training in low-resource asr and kws. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4699–4703. IEEE.
  45. Boyd Michailovsky and Michel Jacobson. 2001. Pangloss archive DTD.
  46. Generalized glossing guidelines: An explicit, human- and machine-readable, item-and-process convention for morphological annotation. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 58–67, Toronto, Canada. Association for Computational Linguistics.
  47. Ulrike Mosel. 2022. Teop DoReCo dataset.
  48. A glossed audio corpus of ainu folklore.
  49. Sebastian Nordhoff and Thomas Krämer. 2022. IMTVault: Extracting and enriching low-resource language interlinear glossed text from grammatical descriptions and typological survey articles. In Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, pages 17–25, Marseille, France. European Language Resources Association.
  50. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  51. Building a time-aligned cross-linguistic reference corpus from language documentation data (DoReCo). In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2657–2666, Marseille, France. European Language Resources Association.
  52. Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer. arXiv preprint arXiv:2401.16658.
  53. Reproducing whisper-style training using an open-source toolkit and publicly available data. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
  54. Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
  55. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  56. Sonja Riesberg. 2022. Yali (Apahapsili) DoReCo dataset.
  57. Hiram Ring. 2022. Pnar DoReCo dataset.
  58. When is tts augmentation through a pivot language useful? In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2022, pages 3538–3542.
  59. Françoise Rose. 2022. Mojeño Trinitario DoReCo dataset.
  60. Sakriani Sakti and Benita Angela Titalim. 2023. Leveraging the multilingual indonesian ethnic languages dataset in self-supervised models for low-resource asr task. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
  61. Domain adaptation of end-to-end speech recognition in low-resource settings. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 382–388. IEEE.
  62. Thomas Schmidt and Kai Wörner. 2014. EXMARaLDA. In The Oxford Handbook of Corpus Phonology. Oxford University Press.
  63. Stefan Schnell. 2015. Multi-CAST Vera’a.
  64. Frank Seifart. 2022. Bora DoReCo dataset.
  65. Language documentation twenty-five years on. Language, 94(4):e324–e345.
  66. Language Documentation Reference Corpus (DoReCo) 1.2.
  67. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  68. Leveraging end-to-end ASR for endangered language documentation: An empirical study on yolóxochitl Mixtec. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1134–1145, Online. Association for Computational Linguistics.
  69. Highland Puebla Nahuatl speech translation corpus for endangered language documentation. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 53–63, Online. Association for Computational Linguistics.
  70. ML-SUPERB: Multilingual Speech Universal PERformance Benchmark. In Proc. INTERSPEECH 2023, pages 884–888.
  71. Findings of the 2023 ml-superb challenge: Pre-training and evaluation over more languages and beyond. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
  72. Asako Shiohara. 2022. Multi-CAST Sumbawa.
  73. Amos Teo. 2022. Sümi DoReCo dataset.
  74. Nick Thieberger. 2022. Nafsan (South Efate) DoReCo dataset.
  75. LAE: Language-Aware Encoder for Monolingual and Multilingual ASR. In Proc. Interspeech 2022, pages 3178–3182.
  76. Martine Vanhove. 2022. Beja DoReCo dataset.
  77. Eline Visser. 2021. Multi-CAST Kalamang.
  78. Maria Vollmer. 2020. Multi-CAST Mandarin.
  79. Kilu von Prince and Sebastian Nordhoff. 2020. An empirical evaluation of annotation practices in corpora from language documentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2778–2787, Marseille, France. European Language Resources Association.
  80. Alexandra Vydrina. 2022. Kakabe DoReCo dataset.
  81. Espnet: End-to-end speech processing toolkit. Interspeech 2018.
  82. Claudia Wegener. 2022. Savosavo DoReCo dataset.
  83. Søren Wichmann. 2022. Texistepec Popoluca DoReCo dataset.
  84. Ruuli DoReCo dataset.
  85. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  86. Ctc alignments improve autoregressive translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1615–1631.
  87. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198.
  88. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836.
  89. Master-asr: achieving multilingual scalability and low-resource adaptation in asr with modular learning. In International Conference on Machine Learning, pages 40475–40487. PMLR.
  90. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Citations (1)

Summary

  • The paper introduces wav2gloss, automating interlinear glossed text extraction from speech using a novel 37-language Fieldwork dataset.
  • It compares end-to-end models with cascaded ASR-to-gloss pipelines, highlighting the benefits of direct sequence-to-sequence approaches.
  • Findings reveal that end-to-end systems reduce error propagation and leverage pre-trained lexical cues to enhance IGT annotation quality.

Generating Interlinear Glossed Text from Speech with wav2gloss

Introduction to wav2gloss

The documentation of endangered languages represents a critical endeavor in the preservation of cultural diversity. Interlinear Glossed Text (IGT), a fundamental tool in linguistic research, facilitates the analysis of unfamiliar languages by providing a word-by-word or morpheme-by-morpheme translation. The wav2gloss task, as proposed, seeks to automate the extraction of IGT annotations from speech recordings—a process traditionally marked by intensive manual labor. This post explores the intricacies of wav2gloss, emphasizing the challenges addressed, methodologies employed, and the implications of the associated paper.

The Fieldwork Dataset

At the core of this research is the introduction of the Fieldwork dataset, designed specifically for the wav2gloss task. Compiling speech data from 37 languages, Fieldwork stands as the first dataset to merge speech recordings with comprehensive IGT annotations, including transcriptions, morphological segmentation, glosses, and translations. The data derivation process involved meticulous selection, standardization, and partitioning strategies to prepare the dataset for machine learning applications. Such efforts underscore the complexity of working with linguistic field data and spotlight the meticulous attention to detail required in dataset construction.

Methodological Overview

The paper compares two primary approaches to the wav2gloss task: end-to-end models and cascaded systems. Leveraging pre-trained speech models (WavLM, XLS-R, and OWSM) modified for sequence-to-sequence tasks, the end-to-end approach directly predicts IGT annotations from speech. On the other hand, the cascaded system first transcribes speech into text using an ASR model, then employs a text-to-gloss model for further annotation. This bifurcation in methodology allows for a comprehensive evaluation of the task's challenges, highlighting the pivotal role of model choice and training strategy in achieving effective IGT generation.

Analytical Insights

The comparison between end-to-end and cascaded systems reveals nuanced performance disparities across various subtasks of IGT annotation. Notably, end-to-end systems exhibit superior performance in scenarios where pre-trained decoders assist with translation and glossing, specifically benefiting from the lexical knowledge embedded in such models. Conversely, the cascaded approach, despite its anticipated advantage in leveraging text-based annotation models, struggles with error propagation, an issue less pronounced in end-to-end systems. Additionally, the analysis uncovers the limitations of multi-task learning in this context, suggesting a potential interference between diverse annotation tasks when modeled simultaneously.

Future Directions and Theoretical Implications

The wav2gloss task opens new horizons for research in language documentation and computational linguistics. The paper's findings emphasize the need for developing machine learning models capable of handling the complexities inherent to linguistic annotation from speech. Future work may explore more sophisticated model architectures, novel pre-training strategies, or multimodal approaches combining speech and text inputs. Theoretically, this research advances our understanding of multilingual model adaptation and the challenges of transferring knowledge across languages with limited resources.

Conclusion

The wav2gloss task represents a pioneering step toward automating the generation of IGT from speech, a development with profound implications for the documentation and preservation of endangered languages. Through the Fieldwork dataset and benchmark models, this research lays the groundwork for future explorations in this domain, challenging the computational linguistics community to devise innovative solutions to the intricate problem of annotating linguistic field data. As this endeavor progresses, it holds the promise of significantly enhancing the efficiency of language documentation practices, thereby contributing to the broader objectives of linguistic diversity and cultural heritage preservation.

X Twitter Logo Streamline Icon: https://streamlinehq.com