Wav2Gloss: Generating Interlinear Glossed Text from Speech (2403.13169v2)
Abstract: Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free translations to a majority language. We propose Wav2Gloss: a task in which these four annotation components are extracted automatically from speech, and introduce the first dataset to this end, Fieldwork: a corpus of speech with all these annotations, derived from the work of field linguists, covering 37 languages, with standard formatting, and train/dev/test splits. We provide various baselines to lay the groundwork for future research on IGT generation from speech, such as end-to-end versus cascaded, monolingual versus multilingual, and single-task versus multi-task approaches.
- Evangelia Adamou. 2015. A corpus-driven analysis of romani in contact with turkish and greek. In Language Variation - European Perspectives V, volume 17. John Benjamins Publishing Company, The Netherlands.
- Shirin Adibifar. 2016. Multi-CAST Persian.
- End-to-end automatic speech recognition: Its impact on the workflowin documenting yoloxóchitl Mixtec. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 64–80, Online. Association for Computational Linguistics.
- Xls-r: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
- Multi-CAST Tabasaran.
- Slavische Mikrosprachen Im Absoluten Sprachkontakt: Glossierte Und Interpretierte Sprachaufnahmen Aus Italien, Deutschland, Österreich Und Griechenland. Teil I: Moliseslavische Texte Aus Acquaviva Collecroce, Montemitro Und San Felice Del Molise, 1 edition. Harrassowitz Verlag.
- Timothy Brickell. 2016. Multi-CAST Tondano.
- Hennie Brugman and Albert Russel. 2004. Annotating multi-media/multi-modal resources with ELAN. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
- INEL Selkup Corpus.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
- Improving massively multilingual asr with auxiliary ctc objectives. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Multi-source cross-lingual model transfer: Learning what to share. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3098–3112, Florence, Italy. Association for Computational Linguistics.
- DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation. In Proc. Interspeech 2020, pages 1803–1807.
- Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 521–527. IEEE.
- The leipzig glossing rules: Conventions for interlinear morpheme-by-morpheme glosses.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
- Andrew Cowell. 2022. Arapaho DoReCo dataset.
- Towards the automatic processing of Yongning Na (Sino-Tibetan): developing a ’light’ acoustic model of the target language and testing ’heavyweight’ models from five national languages. In 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU 2014), pages 153–160, St Petersburg, Russia.
- Chris Lasse Däbritz and Valentin Gusev. 2021. INEL Evenki Corpus.
- INEL Dolgan Corpus.
- Christian Döhler. 2022. Komnzo DoReCo dataset.
- Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1096–1103. IEEE.
- Diana Forker and Nils N. Schiborr. 2019. Multi-CAST Sanzhi Dargwa.
- Findings of the SIGMORPHON 2023 shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 186–201, Toronto, Canada. Association for Computational Linguistics.
- Recent developments on espnet toolkit boosted by conformer. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5874–5878. IEEE.
- INEL Kamas Corpus.
- N||ng DoReCo dataset.
- Geoffrey Haig and Stefan Schnell. 2015. Annotations using graid:(grammatical relations and animacy in discourse): Manual version 7.0.
- Geoffrey Haig and Stefan Schnell, editors. 2022. Multi-CAST. University of Bamberg, Bamberg. Version 2211.
- Multi-CAST Northern Kurdish.
- Andrew Harvey. 2022. Gorwaa DoReCo dataset.
- SigMoreFun submission to the SIGMORPHON shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 209–216, Toronto, Canada. Association for Computational Linguistics.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
- Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration. In Proc. Interspeech 2021, pages 1529–1533.
- Soung-U Kim. 2022. Jejuan DoReCo dataset.
- Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4835–4839. IEEE.
- Yukinori Kimoto. 2019. Multi-CAST Arta.
- Manfred Krifka. 2022. Daakie DoReCo dataset.
- Keita Kurabe. 2021. Multi-CAST Jinghpaw.
- William D. Lewis and Fei Xia. 2010. Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages. Literary and Linguistic Computing, 25(3):303–319.
- Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation. ACM Transactions on Asian and Low-Resource Language Information Processing, 20(2):1–15.
- Chenxi Meng. 2016. Multi-CAST Tulil.
- Semi-supervised training in low-resource asr and kws. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4699–4703. IEEE.
- Boyd Michailovsky and Michel Jacobson. 2001. Pangloss archive DTD.
- Generalized glossing guidelines: An explicit, human- and machine-readable, item-and-process convention for morphological annotation. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 58–67, Toronto, Canada. Association for Computational Linguistics.
- Ulrike Mosel. 2022. Teop DoReCo dataset.
- A glossed audio corpus of ainu folklore.
- Sebastian Nordhoff and Thomas Krämer. 2022. IMTVault: Extracting and enriching low-resource language interlinear glossed text from grammatical descriptions and typological survey articles. In Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, pages 17–25, Marseille, France. European Language Resources Association.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Building a time-aligned cross-linguistic reference corpus from language documentation data (DoReCo). In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2657–2666, Marseille, France. European Language Resources Association.
- Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer. arXiv preprint arXiv:2401.16658.
- Reproducing whisper-style training using an open-source toolkit and publicly available data. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
- Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
- Sonja Riesberg. 2022. Yali (Apahapsili) DoReCo dataset.
- Hiram Ring. 2022. Pnar DoReCo dataset.
- When is tts augmentation through a pivot language useful? In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2022, pages 3538–3542.
- Françoise Rose. 2022. Mojeño Trinitario DoReCo dataset.
- Sakriani Sakti and Benita Angela Titalim. 2023. Leveraging the multilingual indonesian ethnic languages dataset in self-supervised models for low-resource asr task. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
- Domain adaptation of end-to-end speech recognition in low-resource settings. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 382–388. IEEE.
- Thomas Schmidt and Kai Wörner. 2014. EXMARaLDA. In The Oxford Handbook of Corpus Phonology. Oxford University Press.
- Stefan Schnell. 2015. Multi-CAST Vera’a.
- Frank Seifart. 2022. Bora DoReCo dataset.
- Language documentation twenty-five years on. Language, 94(4):e324–e345.
- Language Documentation Reference Corpus (DoReCo) 1.2.
- BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
- Leveraging end-to-end ASR for endangered language documentation: An empirical study on yolóxochitl Mixtec. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1134–1145, Online. Association for Computational Linguistics.
- Highland Puebla Nahuatl speech translation corpus for endangered language documentation. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 53–63, Online. Association for Computational Linguistics.
- ML-SUPERB: Multilingual Speech Universal PERformance Benchmark. In Proc. INTERSPEECH 2023, pages 884–888.
- Findings of the 2023 ml-superb challenge: Pre-training and evaluation over more languages and beyond. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
- Asako Shiohara. 2022. Multi-CAST Sumbawa.
- Amos Teo. 2022. Sümi DoReCo dataset.
- Nick Thieberger. 2022. Nafsan (South Efate) DoReCo dataset.
- LAE: Language-Aware Encoder for Monolingual and Multilingual ASR. In Proc. Interspeech 2022, pages 3178–3182.
- Martine Vanhove. 2022. Beja DoReCo dataset.
- Eline Visser. 2021. Multi-CAST Kalamang.
- Maria Vollmer. 2020. Multi-CAST Mandarin.
- Kilu von Prince and Sebastian Nordhoff. 2020. An empirical evaluation of annotation practices in corpora from language documentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2778–2787, Marseille, France. European Language Resources Association.
- Alexandra Vydrina. 2022. Kakabe DoReCo dataset.
- Espnet: End-to-end speech processing toolkit. Interspeech 2018.
- Claudia Wegener. 2022. Savosavo DoReCo dataset.
- Søren Wichmann. 2022. Texistepec Popoluca DoReCo dataset.
- Ruuli DoReCo dataset.
- Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
- Ctc alignments improve autoregressive translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1615–1631.
- SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198.
- Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836.
- Master-asr: achieving multilingual scalability and low-resource adaptation in asr with modular learning. In International Conference on Machine Learning, pages 40475–40487. PMLR.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.