- The paper introduces NSina as the largest and most diverse Sinhala news corpus compiled from over 500,000 articles gathered across ten prominent news websites.
- It details three NLP tasks—news media identification, category prediction, and headline generation—to benchmark the performance of language models on Sinhala text.
- Results indicate robust classification capabilities with transformer models while exposing significant challenges in generating coherent Sinhala headlines, highlighting areas for future research.
NSina: Unveiling a Large News Corpus for the Sinhala Language
Introduction
The field of NLP has seen a significant shift with the advent of LLMs, which have pushed the boundaries of what's achievable in terms of text processing and generation. Despite their success, LLMs' performance remains closely tied to the availability and quality of pre-training resources, particularly in low-resource languages such as Sinhala. Sinhala, spoken by over 17 million individuals in Sri Lanka, has limited resources for NLP development, making it challenging to leverage the full potential of LLMs for this language. Addressing this gap, this paper introduces "NSina," a comprehensive news corpus for Sinhala, coupled with three NLP tasks aimed at easing and advancing the application of LLMs for the Sinhala language.
Dataset Construction
The NSina corpus was meticulously assembled from ten popular Sri Lankan news websites, ensuring a balanced mix of sources to include a wide range of perspectives. The final corpus contains over 506,932 news articles, thoroughly cleaned to exclude content with less than ten Sinhala words. This process resulted in a dataset significantly larger and more diverse than its predecessors, boasting over 1 million tokens, including more than 100,000 unique tokens. This makes NSina not only the largest but also the most up-to-date corpus available for the Sinhala language, offering rich resources for NLP research and applications.
Benchmarked NLP Tasks
To demonstrate NSina's utility and provide a baseline for future Sinhala NLP development, the paper outlines three distinct NLP tasks: news media identification, news category prediction, and news headline generation. Each task is designed to benchmark the LLMs' performance and provide insights into their capabilities and limitations when handling Sinhala text.
- News Media Identification: This text classification task focuses on identifying the source of a news article based on its content. Given that each source has a unique style, this task not only evaluates models' understanding of textual styles but also serves as a tool for studying political biases within Sri Lankan news media.
- News Category Prediction: Another classification task that organizes news content into predefined categories such as local, international, sports, and business news. The task tests models' ability to understand and categorize content accurately.
- News Headline Generation: This text generation task requires models to produce headlines for given news articles. It evaluates the models' capabilities in generating coherent and contextually relevant Sinhala text, providing insights into the challenges of Sinhala language generation.
Evaluation and Insights
The performance evaluation across these tasks revealed that while transformer models like XLM-R and SinBERT demonstrate strong capabilities in text classification, achieving high F1 scores, they struggle significantly with the text generation task. Notably, the text generation transformers, mBART and mT5, did not fare much better in headline generation, highlighting a crucial area for improvement in Sinhala LLMing. This performance discrepancy underscores the need for continued research and development of Sinhala-specific models and benchmarks.
Conclusion and Future Directions
NSina significantly contributes to Sinhala NLP by offering the largest and most comprehensive news corpus to date. Alongside the benchmark tasks, NSina sets the stage for future advancements in Sinhala language processing. The paper underscores the potential of NSina to foster improved model training and benchmarking, although it also highlights the challenges that lie ahead, especially in language generation tasks.
As the field moves forward, the research community is encouraged to build upon NSina, leveraging its resources to develop robust models tailored for Sinhala. Additionally, the establishment of a GLUE-like benchmark for Sinhala, as mentioned, would be a pivotal step towards standardizing model evaluation and promoting further innovation in the field of Sinhala NLP.