- The paper presents Turna, a Turkish encoder-decoder model that employs the UL2 framework to enhance both understanding and generation for low-resource languages.
- It utilizes a 43-billion-token Turkish corpus spanning diverse domains to achieve superior performance in tasks like paraphrasing, summarization, NER, and NLI.
- The model outperforms both multilingual baselines and monolingual competitors, setting a new benchmark for inclusive NLP research in Turkish.
Turna: A Turkish Encoder-Decoder LLM for Enhanced Understanding and Generation
The paper introduces Turna, a significant advancement in the field of Turkish natural language processing through a large-scale, encoder-decoder pretrained model. Turna uniquely addresses both natural language understanding (NLU) and generation (NLG) for the low-resource Turkish language, leveraging a robust corpus curated specifically for this purpose. In contrast to many models that focus predominantly on English or other well-resourced languages, Turna arises from a growing need to reduce the disparity in NLP performance across various languages.
Methodology and Data
Turna builds upon the encoder-decoder architecture and utilizes the Unifying Language Learning (UL2) framework. The pretraining involves the Mixture-of-Denoisers (MoD) objective, which assists in dealing with multiple denoising tasks like span corruption and sequence prediction. This choice affirms the model's versatility, enabling it to excel across a spectrum of linguistic tasks requiring both understanding and generation capabilities.
The authors have meticulously assembled a diverse Turkish corpus comprising web content, scientific literature, books, creative writing, and parliamentary records, totaling approximately 43 billion tokens. This dataset stands out due to its breadth and represents a variety of textual domains, ensuring that the model adapts to different linguistic styles and contents within Turkish.
Evaluation and Results
Turna was benchmarked against multilingual models such as mT5 and mBART, as well as the monolingual encoder-only model BERTurk. It was evaluated using a comprehensive suite of tasks: paraphrasing, summarization, title generation, NER, POS tagging, STS, NLI, and sentiment analysis.
The model exhibited superior performance over the multilingual baselines in both paraphrasing and summarization tasks while also achieving competitive results in Turkish NLU tasks compared to BERTurk. Notably, Turna’s encoder-only variant outperformed BERTurk in NLU tasks like named entity recognition and natural language inference, underscoring its capability to manage task-specific complexities and nuances. The robust evaluation provides a holistic perspective on Turna’s capabilities, illustrating its ability to span a broad range of applications in both understanding and generation domains.
Implications and Future Directions
Turna's success signifies a promising future for developing NLP resources for low-resource languages. It also echoes the potential effectiveness of employing unified frameworks like UL2 for building models that do not simply cater to a single facet of language processing. Moreover, its training on a comprehensive combination of datasets reinforces the importance of heterogeneity in pretraining corpora, especially for languages with less available data compared to English.
The methodology employed suggests several future research prospects—considering further pretraining with increased token counts, investigating other language pairs, or expanding to additional downstream tasks. Fundamentally, the release of Turna and its resources catalyzes further innovation and sets a benchmark for future models aimed at less-resourced languages.
Given the current landscape of NLP where linguistic inequity remains a challenge, Turna exemplifies strides towards realizing models that are more inclusive and balanced. The researchers’ open-sourcing of the model and related datasets signifies a substantial contribution to the community, empowering further developments and studies in Turkish NLP.