Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 109 tok/s Pro

Kimi K2 181 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models (2501.04828v1)

Published 8 Jan 2025 in cs.CL

Abstract: This paper introduces foundational resources and models for NLP of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at https://huggingface.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.

Collections

Summary

The paper introduces the first named entity recognition dataset (HisTR) and Universal Dependencies treebank (OTA-BOUN) as foundational resources for historical Turkish natural language processing.
Experiments show that fine-tuned language-specific models like BERTurk outperform multilingual models on historical datasets, and combining modern and historical corpora improves parsing accuracy.
Significant challenges include domain adaptation and linguistic variation across historical periods, indicating a need for more expanded and representative datasets.

An Overview of "Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models"

This paper addresses a critical gap in the domain of computational linguistics by introducing foundational resources and models for the NLP of historical Turkish. Specifically, the authors develop datasets, annotate corpora, and build models to tackle the challenges associated with NLP tasks in historical versions of the Turkish language, a subject that has been relatively underexplored. This research introduces several significant resources and model architectures tailored for parsing and understanding historical Turkish texts, focusing on named entity recognition (NER), dependency parsing, and part-of-speech (POS) tagging.

Key Contributions

Datasets and Corpora: The paper presents the first named entity recognition dataset, HisTR, and the first Universal Dependencies treebank, OTA-BOUN, for historical Turkish. Both resources are essential for training and evaluating NLP models in tasks that require a deep understanding of the language's historical structures and usage.
NER Dataset - HisTR: This dataset consists of 812 manually annotated sentences with 651 PERSON and 1,010 LOCATION entities. The dataset's development focused on the period from the 17th to the 20th centuries, ensuring a comprehensive representation of historical Turkish due to the linguistic richness and variation over these times.
Dependency Treebank - OTA-BOUN: OTA-BOUN includes syntactic annotations that facilitate dependency parsing and POS tagging. This treebank is crucial in illuminating the structural intricacies of historical Turkish, which has evolved significantly from its Ottoman roots to the form used today.
Ottoman Text Corpus (OTC): The OTC is a clean, transliterated corpus that spans the 15th to 20th centuries, providing a foundational text resource for various linguistic purposes including model training and LLMing.
Modeling Approaches: Transformer-based models, including BERTurk and mBERT, have been trained on these novel datasets. The experiments demonstrate promising performance results, underscoring the effectiveness of adapting pre-trained models for historical language tasks.

Experimental Insights

The paper's experimental evaluations show that BERTurk, when fine-tuned on modern Turkish and then adapted to historical datasets like HisTR and OTA-BOUN, outperforms generic multilingual models such as mBERT. This suggests that language-specific models, pre-trained on similar linguistic data, may retain contextual understanding that enhances performance on historical scripts. Furthermore, the dependency parsing results highlight that combining modern and historical corpora during training can significantly improve parsing accuracy, as demonstrated by the notable improvements in labeled attachment scores.

Challenges and Implications

The challenges faced in this research include domain adaptation and significant variations in language use across different historical periods. These variations pose difficulties in model generalization and robustness. The authors acknowledge that despite the positive results, the models struggle with out-of-domain data, an issue particularly prominent with the {\it Ruznamçe} test set. This necessitates more specialized datasets and further development of domain adaptation techniques.

Future Directions

The research paves the way for much-needed advancements in the computational analysis of historical Turkish. Continued efforts in expanding these datasets and resources, along with the development of more sophisticated models that encompass greater periods of historical language evolution, are outlined as critical future directions. Additionally, enhancing the Ottoman Text Corpus to balance the representation across various periods could enable more nuanced models that align closer with linguistic shifts throughout history.

Overall, this paper establishes crucial benchmarks and foundational resources while inviting further exploration and innovation in the NLP of low-resource historical languages, contributing substantially to the field of digital humanities and providing a window into the linguistic past of Turkish.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (7)

Tweets

https://twitter.com/arXivGPT/status/1878141878513594514

https://twitter.com/Atabey_Kaygun/status/1877661220146213101