One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks (2409.13920v1)

Published 20 Sep 2024 in cs.CL and cs.LG

Abstract: Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained LLM, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained LLMs can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.

Summary

The paper demonstrates that the unified ByT5-Sanskrit model substantially outperforms prior baselines in key NLP tasks like word segmentation (up by 8.8 points), dependency parsing, and OCR post-correction.
It employs a novel pretrain-fine-tune strategy on a 6.5 billion token Sanskrit corpus with multitask integration, optimizing context utilization through distinct prefix tokens for each task.
The model's success in Sanskrit suggests broader applicability to other morphologically rich languages, potentially advancing computational linguistics across diverse language families.

Insights into ByT5-Sanskrit: A Unified Model for Sanskrit NLP Tasks

The paper "One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks" addresses the significant challenges associated with processing morphologically rich languages (MRLs) and proposes a solution specifically tailored for Sanskrit using a byte-level pretrained LLM, ByT5-Sanskrit. This model achieves substantial improvements in several downstream tasks, indicating a robust methodology applicable to other MRLs as well.

Key Achievements

ByT5-Sanskrit demonstrates marked advancements in multiple NLP tasks for Sanskrit, including:

Sanskrit Word Segmentation (SWS): The model outperforms existing baselines such as rcNN-SS and approaches the performance of lexicon-driven models like TransLIST by achieving a significant performance gain. For instance, ByT5-Sanskrit surpasses previous models by 8.8 points on the Hackathon SWS benchmark.
Vedic Sanskrit Dependency Parsing: The model shows improved Unlabeled Attachment Score (UAS) and Labeled Attachment Score (LAS) by 2.18 and 2.60 points, respectively, compared to existing best-performing models.
OCR Post-correction: ByT5-Sanskrit sets new standards in OCR post-correction tasks with 0.29 lower Character Error Rate (CER) and 3.16 lower Word Error Rate (WER).
Multitask Dataset Integration: The researchers introduce a novel dataset based on the Digital Corpus of Sanskrit (DCS) to facilitate joint training for tasks such as word segmentation, lemmatization, and morphosyntactic tagging. The model trained on this dataset notably improves overall task performance.

Technical Approach

ByT5-Sanskrit’s architecture builds on the ByT5 model, and the paper outlines a pretrain-fine-tune paradigm. The model is pretrained on a vast corpus of Sanskrit texts in IAST transliteration from sources like the Sangraha dataset, GRETIL collection, and the Digital Sanskrit Buddhist Canon. This pretraining leverages a dataset size of approximately 6.5 billion tokens.

The novel method of joint task formulation wherein tasks like SWS, lemmatization, and morphosyntactic tagging are combined into a multitask setup is prominent in this research. For each task, input sequences are distinguished by prefix tokens (e.g., "S", "L", and "M"). This setup is not just efficient but also ensures high context utilization, as demonstrated by enhanced performance on pseudo-paragraph-level evaluations compared to sentence-level.

Evaluation Metrics

Sanskrit Word Segmentation (SWS): The metrics were primarily Sentence level Perfect Matches (PM). ByT5-Sanskrit outperformed prior state-of-the-art models across various datasets, notably achieving 94.29 on the Hackathon dataset compared to the previous best of 85.47.
Dependency Parsing: Evaluated using UAS and LAS, ByT5-Sanskrit showed substantial improvement with scores of 86.54 (UAS) and 81.54 (LAS) in the "None" setting with no additional linguistic information, highlighting its inherent robustness.
OCR Post-correction: ByT5-Sanskrit achieved CER of 2.69 and WER of 20.03, outperforming the ByT5-small baseline significantly.

Broader Implications and Future Directions

ByT5-Sanskrit exemplifies a scalable and adaptable model for MRLs. The comparative performance on additional languages such as Turkish, Romanian, and Bulgarian underscores the model’s broader applicability beyond Sanskrit. Specifically, it surpasses previous baselines in terms of lemmatization accuracy for two of these languages and in terms of LAS for all three languages.

Limitations and Further Work

Despite the robust performance, ByT5-Sanskrit has limitations in handling homonymy effectively, a prevalent issue in lemmatization where words possess multiple meanings. The researchers propose future efforts to address this by tagging lemmata with numeric affixes. Moreover, the bias in test data towards Vedic texts necessitates further exploration to ensure broader generalizability.

The pretrain-fine-tune paradigm coupled with byte-level models offers a promising direction for MRLs, potentially marking a pivotal evolution in NLP practices for these languages. Future developments might explore integrating more sophisticated disambiguation techniques and enlarging training datasets to encapsulate a wider linguistic variety, further consolidating the model’s efficacy.

In conclusion, ByT5-Sanskrit presents a robust, efficient method for processing Sanskrit, contributing valuable insights and methods that hold potential applicability for a wide range of MRLs, thus promoting further advancements in computational linguistics.