MSNER: A Multilingual Speech Dataset for Named Entity Recognition (2405.11519v1)
Abstract: While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.
- SLURP: A spoken language understanding resource package. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- NoSta-D named entity annotation for German: Guidelines and dataset. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2524–2531, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.
- A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinformatics, 18.
- The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania.
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML. Morgan Kaufmann Publishers Inc.
- Timers and such: A practical benchmark for spoken language understanding with numbers. CoRR, abs/2104.01604.
- Speech Model Pre-Training for End-to-End Spoken Language Understanding. In Interspeech.
- Emi Maekawa. 2018. Annotation guidelines for named entities. online.
- Whisper-slu: Extending a pretrained speech-to-text transformer for low resource spoken language understanding. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
- Robust speech recognition via large-scale weak supervision. CoRR.
- Lance Ramshaw and Mitch Marcus. 1995. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora.
- Vincent Renkens and Hugo Van hamme. 2018. Capsule networks for low resource spoken language understanding. In Proc. Interspeech. International Speech Communication Association.
- Spoken language understanding on the edge. CoRR.
- SLUE: new benchmark tasks for spoken language understanding evaluation on natural speech. CoRR.
- Named entity recognition for entity linking: What works and what’s next.
- Simone Tedeschi and Roberto Navigli. 2022. MultiNERD: A multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, United States. Association for Computational Linguistics.
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.
- Asahi Ushio and Jose Camacho-Collados. 2021. T-NER: An all-round python library for transformer-based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics.
- Community annotation experiment for ground truth generation for the i2b2 medication challenge. Journal of the American Medical Informatics Association, 17(5).
- VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. Association for Computational Linguistics.
- Transformers: State-of-the-art natural language processing. In EMNLP. ACL.
- Mitigating transformer overconfidence via Lipschitz regularization. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pages 2422–2432. PMLR.
- Ralph Weischedel and Martha Palmer and Mitchell Marcus and Eduard Hovy and Sameer Pradhan and Lance Ramshaw and Nianwen Xue and Ann Taylor and Jeff Kaufman and Michelle Franchini and Mohammed El-Bachouti and Robert Belvin and Ann Houston. 2013. OntoNotes Release 5.0. Linguistic Data Consortium LDC2013T19, ISLRN 151-738-649-048-2.