Novi jezički modeli za srpski jezik (2402.14379v2)
Abstract: The paper will briefly present the development history of transformer-based LLMs for the Serbian language. Several new models for text generation and vectorization, trained on the resources of the Society for Language Resources and Technologies, will also be presented. Ten selected vectorization models for Serbian, including two new ones, will be compared on four natural language processing tasks. Paper will analyze which models are the best for each selected task, how does their size and the size of their training sets affect the performance on those tasks, and what is the optimal setting to train the best LLMs for the Serbian language.
- ‘‘The Leipzig corpora collection-monolingual corpora of standard size’’ In Proceedings of corpus linguistic 2007, 2007
- Miloš Bogdanović, Jelena Kocić and Leonid Stoimenov ‘‘SRBerta-A Transformer Language Model for Serbian Cyrillic Legal Texts’’ In Information 15.2, 2024 DOI: 10.3390/info15020074
- ‘‘Electra: Pre-training text encoders as discriminators rather than generators’’ In arXiv preprint arXiv:2003.10555, 2020
- ‘‘Unsupervised cross-lingual representation learning at scale’’ In arXiv preprint arXiv:1911.02116, 2019
- Andrija Cvejić ‘‘Prepoznavanje imenovanih entiteta u srpskom jeziku pomoću transformer arhitekture’’ In Zbornik radova Fakulteta tehničkih nauka u Novom Sadu 37.02, 2022, pp. 310–315
- ‘‘Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian’’ In 2023 31st Telecommunications Forum (TELFOR), 2023, pp. 1–4 IEEE
- ‘‘Bert: Pre-training of deep bidirectional transformers for language understanding’’ In arXiv preprint arXiv:1810.04805, 2018
- ‘‘MACEDONIZER-The Macedonian Transformer Language Model’’ In International Conference on ICT Innovations, 2022, pp. 51–62 Springer
- ‘‘DEBERTA: Decoding-Enhanced BERT with Disentangled Attention’’ In International Conference on Learning Representations, 2020
- ‘‘Mistral 7B’’ In arXiv preprint arXiv:2310.06825, 2023
- ‘‘Language Report Serbian’’ In European Language Equality: A Strategic Agenda for Digital Language Equality Cham: Springer International Publishing, 2023, pp. 203–206 DOI: 10.1007/978-3-031-28819-7_32
- Yann LeCun, Yoshua Bengio and Geoffrey Hinton ‘‘Deep learning’’ In nature 521.7553 Nature Publishing Group UK London, 2015, pp. 436–444 DOI: 10.1038/nature14539
- ‘‘BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension’’ In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880
- ‘‘Textbooks are all you need ii: phi-1.5 technical report’’ In arXiv preprint arXiv:2309.05463, 2023
- ‘‘Roberta: A robustly optimized BERT pretraining approach’’ In arXiv preprint arXiv:1907.11692, 2019
- ‘‘{{\{{bs, hr, sr}}\}} wac-web corpora of Bosnian, Croatian and Serbian’’ In Proceedings of the 9th web as corpus workshop (WaC-9), 2014, pp. 29–35
- ‘‘BERTić–The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian’’ In arXiv preprint arXiv:2104.09243, 2021
- ‘‘Improving language understanding by generative pre-training’’ OpenAI, 2018
- ‘‘Language models are unsupervised multitask learners’’ OpenAI, 2019
- ‘‘Exploring the limits of transfer learning with a unified text-to-text transformer’’ In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
- Pranav Rajpurkar, Robin Jia and Percy Liang ‘‘Know what you don’t know: Unanswerable questions for SQuAD’’ In arXiv preprint arXiv:1806.03822, 2018
- ‘‘Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks’’ In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992
- Mihailo Škorić ‘‘Композитне псеудограматике засноване на паралелним jезичким моделима српског jезика’’ докторска дисертациjа, 2023
- Mihailo Škorić, Miloš Utvić and Ranka Stanković ‘‘Transformer-Based Composite Language Models for Text Evaluation and Classification’’ In Mathematics 11.22 MDPI, 2023, pp. 4660
- ‘‘Machine learning and deep neural network-based lemmatization and morphosyntactic tagging for serbian’’ In Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 3954–3962
- Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary ‘‘Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures’’ In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7), 2019 Leibniz-Institut für Deutsche Sprache
- ‘‘Alpaca: A strong, replicable instruction-following model’’ In Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3.6, 2023, pp. 7
- ‘‘Serbian ner&beyond: The archaic and the modern intertwinned’’ In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021, pp. 1252–1260
- ‘‘Attention is all you need’’ In Advances in neural information processing systems 30, 2017
- ‘‘Tour du monde through the dictionaries’’ In Actes du 27eme Colloque International sur le Lexique et la Gammaire, 2008, pp. 249–256
- Philipp Wasserscheidt ‘‘Serbian Web Corpus PDRS 1.0’’ Slovenian language resource repository CLARIN.SI, 2023 URL: http://hdl.handle.net/11356/1752
- ‘‘CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data’’ In Proceedings of the 12th Language Resources and Evaluation Conference Marseille, France: European Language Resources Association, 2020, pp. 4003–4012 URL: https://www.aclweb.org/anthology/2020.lrec-1.494
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.