Papers
Topics
Authors
Recent
2000 character limit reached

Novi jezički modeli za srpski jezik (2402.14379v2)

Published 22 Feb 2024 in cs.CL

Abstract: The paper will briefly present the development history of transformer-based LLMs for the Serbian language. Several new models for text generation and vectorization, trained on the resources of the Society for Language Resources and Technologies, will also be presented. Ten selected vectorization models for Serbian, including two new ones, will be compared on four natural language processing tasks. Paper will analyze which models are the best for each selected task, how does their size and the size of their training sets affect the performance on those tasks, and what is the optimal setting to train the best LLMs for the Serbian language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. ‘‘The Leipzig corpora collection-monolingual corpora of standard size’’ In Proceedings of corpus linguistic 2007, 2007
  2. Miloš Bogdanović, Jelena Kocić and Leonid Stoimenov ‘‘SRBerta-A Transformer Language Model for Serbian Cyrillic Legal Texts’’ In Information 15.2, 2024 DOI: 10.3390/info15020074
  3. ‘‘Electra: Pre-training text encoders as discriminators rather than generators’’ In arXiv preprint arXiv:2003.10555, 2020
  4. ‘‘Unsupervised cross-lingual representation learning at scale’’ In arXiv preprint arXiv:1911.02116, 2019
  5. Andrija Cvejić ‘‘Prepoznavanje imenovanih entiteta u srpskom jeziku pomoću transformer arhitekture’’ In Zbornik radova Fakulteta tehničkih nauka u Novom Sadu 37.02, 2022, pp. 310–315
  6. ‘‘Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian’’ In 2023 31st Telecommunications Forum (TELFOR), 2023, pp. 1–4 IEEE
  7. ‘‘Bert: Pre-training of deep bidirectional transformers for language understanding’’ In arXiv preprint arXiv:1810.04805, 2018
  8. ‘‘MACEDONIZER-The Macedonian Transformer Language Model’’ In International Conference on ICT Innovations, 2022, pp. 51–62 Springer
  9. ‘‘DEBERTA: Decoding-Enhanced BERT with Disentangled Attention’’ In International Conference on Learning Representations, 2020
  10. ‘‘Mistral 7B’’ In arXiv preprint arXiv:2310.06825, 2023
  11. ‘‘Language Report Serbian’’ In European Language Equality: A Strategic Agenda for Digital Language Equality Cham: Springer International Publishing, 2023, pp. 203–206 DOI: 10.1007/978-3-031-28819-7_32
  12. Yann LeCun, Yoshua Bengio and Geoffrey Hinton ‘‘Deep learning’’ In nature 521.7553 Nature Publishing Group UK London, 2015, pp. 436–444 DOI: 10.1038/nature14539
  13. ‘‘BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension’’ In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880
  14. ‘‘Textbooks are all you need ii: phi-1.5 technical report’’ In arXiv preprint arXiv:2309.05463, 2023
  15. ‘‘Roberta: A robustly optimized BERT pretraining approach’’ In arXiv preprint arXiv:1907.11692, 2019
  16. ‘‘{{\{{bs, hr, sr}}\}} wac-web corpora of Bosnian, Croatian and Serbian’’ In Proceedings of the 9th web as corpus workshop (WaC-9), 2014, pp. 29–35
  17. ‘‘BERTić–The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian’’ In arXiv preprint arXiv:2104.09243, 2021
  18. ‘‘Improving language understanding by generative pre-training’’ OpenAI, 2018
  19. ‘‘Language models are unsupervised multitask learners’’ OpenAI, 2019
  20. ‘‘Exploring the limits of transfer learning with a unified text-to-text transformer’’ In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
  21. Pranav Rajpurkar, Robin Jia and Percy Liang ‘‘Know what you don’t know: Unanswerable questions for SQuAD’’ In arXiv preprint arXiv:1806.03822, 2018
  22. ‘‘Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks’’ In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992
  23. Mihailo Škorić ‘‘Композитне псеудограматике засноване на паралелним jезичким моделима српског jезика’’ докторска дисертациjа, 2023
  24. Mihailo Škorić, Miloš Utvić and Ranka Stanković ‘‘Transformer-Based Composite Language Models for Text Evaluation and Classification’’ In Mathematics 11.22 MDPI, 2023, pp. 4660
  25. ‘‘Machine learning and deep neural network-based lemmatization and morphosyntactic tagging for serbian’’ In Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 3954–3962
  26. Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary ‘‘Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures’’ In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7), 2019 Leibniz-Institut für Deutsche Sprache
  27. ‘‘Alpaca: A strong, replicable instruction-following model’’ In Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3.6, 2023, pp. 7
  28. ‘‘Serbian ner&beyond: The archaic and the modern intertwinned’’ In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021, pp. 1252–1260
  29. ‘‘Attention is all you need’’ In Advances in neural information processing systems 30, 2017
  30. ‘‘Tour du monde through the dictionaries’’ In Actes du 27eme Colloque International sur le Lexique et la Gammaire, 2008, pp. 249–256
  31. Philipp Wasserscheidt ‘‘Serbian Web Corpus PDRS 1.0’’ Slovenian language resource repository CLARIN.SI, 2023 URL: http://hdl.handle.net/11356/1752
  32. ‘‘CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data’’ In Proceedings of the 12th Language Resources and Evaluation Conference Marseille, France: European Language Resources Association, 2020, pp. 4003–4012 URL: https://www.aclweb.org/anthology/2020.lrec-1.494

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.