TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation (2401.14373v1)

Published 25 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The recent advances in natural language processing have predominantly favored well-resourced English-centric models, resulting in a significant gap with low-resource languages. In this work, we introduce the LLM TURNA, which is developed for the low-resource language Turkish and is capable of both natural language understanding and generation tasks. TURNA is pretrained with an encoder-decoder architecture based on the unified framework UL2 with a diverse corpus that we specifically curated for this purpose. We evaluated TURNA with three generation tasks and five understanding tasks for Turkish. The results show that TURNA outperforms several multilingual models in both understanding and generation tasks, and competes with monolingual Turkish models in understanding tasks. TURNA is made available at https://huggingface.co/boun-tabi-LMG/TURNA .

Citations (8)

View on Semantic Scholar

Summary

The paper presents Turna, a Turkish encoder-decoder model that employs the UL2 framework to enhance both understanding and generation for low-resource languages.
It utilizes a 43-billion-token Turkish corpus spanning diverse domains to achieve superior performance in tasks like paraphrasing, summarization, NER, and NLI.
The model outperforms both multilingual baselines and monolingual competitors, setting a new benchmark for inclusive NLP research in Turkish.

Turna: A Turkish Encoder-Decoder LLM for Enhanced Understanding and Generation

The paper introduces Turna, a significant advancement in the field of Turkish natural language processing through a large-scale, encoder-decoder pretrained model. Turna uniquely addresses both natural language understanding (NLU) and generation (NLG) for the low-resource Turkish language, leveraging a robust corpus curated specifically for this purpose. In contrast to many models that focus predominantly on English or other well-resourced languages, Turna arises from a growing need to reduce the disparity in NLP performance across various languages.

Methodology and Data

Turna builds upon the encoder-decoder architecture and utilizes the Unifying Language Learning (UL2) framework. The pretraining involves the Mixture-of-Denoisers (MoD) objective, which assists in dealing with multiple denoising tasks like span corruption and sequence prediction. This choice affirms the model's versatility, enabling it to excel across a spectrum of linguistic tasks requiring both understanding and generation capabilities.

The authors have meticulously assembled a diverse Turkish corpus comprising web content, scientific literature, books, creative writing, and parliamentary records, totaling approximately 43 billion tokens. This dataset stands out due to its breadth and represents a variety of textual domains, ensuring that the model adapts to different linguistic styles and contents within Turkish.

Evaluation and Results

Turna was benchmarked against multilingual models such as mT5 and mBART, as well as the monolingual encoder-only model BERTurk. It was evaluated using a comprehensive suite of tasks: paraphrasing, summarization, title generation, NER, POS tagging, STS, NLI, and sentiment analysis.

The model exhibited superior performance over the multilingual baselines in both paraphrasing and summarization tasks while also achieving competitive results in Turkish NLU tasks compared to BERTurk. Notably, Turna’s encoder-only variant outperformed BERTurk in NLU tasks like named entity recognition and natural language inference, underscoring its capability to manage task-specific complexities and nuances. The robust evaluation provides a holistic perspective on Turna’s capabilities, illustrating its ability to span a broad range of applications in both understanding and generation domains.

Implications and Future Directions

Turna's success signifies a promising future for developing NLP resources for low-resource languages. It also echoes the potential effectiveness of employing unified frameworks like UL2 for building models that do not simply cater to a single facet of language processing. Moreover, its training on a comprehensive combination of datasets reinforces the importance of heterogeneity in pretraining corpora, especially for languages with less available data compared to English.

The methodology employed suggests several future research prospects—considering further pretraining with increased token counts, investigating other language pairs, or expanding to additional downstream tasks. Fundamentally, the release of Turna and its resources catalyzes further innovation and sets a benchmark for future models aimed at less-resourced languages.

Given the current landscape of NLP where linguistic inequity remains a challenge, Turna exemplifies strides towards realizing models that are more inclusive and balanced. The researchers’ open-sourcing of the model and related datasets signifies a substantial contribution to the community, empowering further developments and studies in Turkish NLP.

TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation (2401.14373v1)

Summary

Turna: A Turkish Encoder-Decoder LLM for Enhanced Understanding and Generation

Methodology and Data

Evaluation and Results

Implications and Future Directions

GitHub

Tweets

TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation (2401.14373v1)

Summary

Turna: A Turkish Encoder-Decoder LLM for Enhanced Understanding and Generation

Methodology and Data

Evaluation and Results

Implications and Future Directions

Related Papers

GitHub

Tweets