Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian (2404.08617v1)
Abstract: In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTi\'c model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.
- Attention is all you need. CoRR, abs/1706.03762, 2017.
- OpenAI. Gpt-4 technical report, 2023.
- Bertić - the transformer language model for bosnian, croatian, montenegrin and serbian. CoRR, abs/2104.09243, 2021.
- Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116, 2019.
- Automatic spanish translation of the squad dataset for multilingual question answering. CoRR, abs/1912.05200, 2019.
- Squad: 100,000+ questions for machine comprehension of text, 2016.
- On the cross-lingual transferability of monolingual representations. CoRR, abs/1910.11856, 2019.
- Newsqa: A machine comprehension dataset, 2017.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019.
- Subjqa: A dataset for subjectivity and review comprehension. CoRR, abs/2004.14283, 2020.
- Sberquad - russian reading comprehension dataset: Description and analysis. CoRR, abs/1912.09723, 2019.
- Fquad: French question answering dataset. CoRR, abs/2002.06071, 2020.
- Know what you don’t know: Unanswerable questions for squad, 2018.
- MLQA: evaluating cross-lingual extractive question answering. CoRR, abs/1910.07475, 2019.
- Neural arabic question answering. CoRR, abs/1906.05394, 2019.
- Semi-supervised training data generation for multilingual question answering. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA).
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
- Roberta: A robustly optimized bert pretraining approach, 2019.
- Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.
- No language left behind: Scaling human-centered machine translation, 2022.
- Georges Labrèche. Cyrtranslit, March 2023. A Python package for bi-directional transliteration of Cyrillic script to Latin script and vice versa. Supports transliteration for Bulgarian, Montenegrin, Macedonian, Mongolian, Russian, Serbian, Tajik, and Ukrainian.
- Efficient word alignment with markov chain monte carlo. The Prague Bulletin of Mathematical Linguistics, 106, 10 2016.
- A simple, fast, and effective reparameterization of IBM model 2. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages 644–648. The Association for Computational Linguistics, 2013.
- Aleksa Cvetanović (1 paper)
- Predrag Tadić (5 papers)