Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models (2404.02408v1)

Published 3 Apr 2024 in cs.CL

Abstract: Effectively using NLP tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguistic Annotation Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages, even with limited training data. We describe various tools and APIs that are currently available and how developers can easily add new models/functionality to the framework. Code is available at https://github.com/neulab/cmulab along with a live demo at https://cmulab.dev

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569.
  2. Making more of little data: Improving low-resource automatic speech recognition using data augmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 715–729, Toronto, Canada. Association for Computational Linguistics.
  3. H2O open ecosystem for state-of-the-art large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 82–89, Singapore. Association for Computational Linguistics.
  4. Glosslm: Multilingual pretraining for low-resource interlinear glossing.
  5. Survey of low-resource machine translation. Computational Linguistics, 48:673–732.
  6. A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2545–2568, Online. Association for Computational Linguistics.
  7. Exploiting adapters for cross-lingual low-resource speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:317–329.
  8. The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In proceedings of the 27th international conference on computational linguistics: system demonstrations, pages 5–9.
  9. Universal phone recognition with a multilingual allophone system. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8249–8253. IEEE.
  10. Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 875–880, Brussels, Belgium. Association for Computational Linguistics.
  11. Towards a general-purpose linguistic annotation backend. In Workshop on The Use of Computational Methods in the Study of Endangered Languages (Compute-EL), Honolulu, Hawaii.
  12. Survey of post-ocr processing approaches. ACM Computing Surveys (CSUR), 54(6):1–37.
  13. No language left behind: Scaling human-centered machine translation.
  14. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  15. Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11):1–37.
  16. OCR Post Correction for Endangered Language Texts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5931–5942, Online. Association for Computational Linguistics.
  17. Prompt2model: Generating deployable models from natural language instructions.
  18. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE.
  19. ELAN: a professional framework for multimodality research. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
  20. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  21. Acoustic modeling based on deep learning for low-resource speech recognition: An overview. IEEE Access, 8:163829–163843.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com