Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages (2103.00854v3)

Published 1 Mar 2021 in cs.CL

Abstract: While there has been significant progress towards developing NLU resources for Indic languages, syntactic evaluation has been relatively less explored. Unlike English, Indic languages have rich morphosyntax, grammatical genders, free linear word-order, and highly inflectional morphology. In this paper, we introduce Vy=akarana: a benchmark of Colorless Green sentences in Indic languages for syntactic evaluation of multilingual LLMs. The benchmark comprises four syntax-related tasks: PoS Tagging, Syntax Tree-depth Prediction, Grammatical Case Marking, and Subject-Verb Agreement. We use the datasets from the evaluation tasks to probe five multilingual LLMs of varying architectures for syntax in Indic languages. Due to its prevalence, we also include a code-switching setting in our experiments. Our results show that the token-level and sentence-level representations from the Indic LLMs (IndicBERT and MuRIL) do not capture the syntax in Indic languages as efficiently as the other highly multilingual LLMs. Further, our layer-wise probing experiments reveal that while mBERT, DistilmBERT, and XLM-R localize the syntax in middle layers, the Indic LLMs do not show such syntactic localization.

Vyākarana: A Benchmark for Syntactic Evaluation in Indic Languages

The paper “Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages” addresses a gap in syntactic evaluation resources for Indic languages within natural language understanding (NLU). The focus is on building a comprehensive syntactic evaluation benchmark specifically tailored for Indic languages, utilizing a collection of Colorless Green sentences. Given that Indic languages exhibit complex linguistic properties such as rich morphosyntax, grammatical genders, and varied word orders, they present unique challenges not well captured by existing multilingual LLMs designed primarily for Indo-European languages.

Key Contributions

  1. Vyākarana Benchmark: The primary contribution of the paper is the introduction of the Vyākarana benchmark, specifically targeting syntactic evaluation through four linguistically rich, syntax-related tasks:
    • Part of Speech (PoS) Tagging
    • Syntax Tree-Depth Prediction
    • Grammatical Case Marking
    • Subject-Verb Agreement
  2. Colorless Green Sentences: The benchmark leverages “Colorless Green” sentences, which are syntactically valid but semantically nonsensical, ensuring that model evaluations focus purely on syntactic comprehension rather than semantic cues.
  3. Multilingual Context: The evaluation includes both monolingual (Indic languages) and code-switched datasets (Indic languages mixed with English). This recognizes the prevalence of code-switching in South Asian linguistic communities, adding another layer of complexity and realism to the evaluations.
  4. Model Evaluation: Five multilingual LLMs were assessed, including IndicBERT and MuRIL, alongside more widely used models like mBERT, DistilmBERT, and XLM-R. The evaluation focused on their ability to capture syntactic structures within Indic languages.

Findings

The paper reveals several key insights about the performance of existing models on syntactic tasks in Indic languages:

  • Syntactic Localization Deficiency: Indic-specific models (IndicBERT and MuRIL) exhibit a lack of syntactic information localization compared to mBERT, XLM-R, and DistilmBERT, which localize such information in the middle layers of their architectures.
  • Performance Limitations: Despite being trained on Indic texts, IndicBERT and MuRIL underperformed compared to highly multilingual models, indicating a need for improved architecture or training regimens that better capture Indic syntactic properties.
  • Code-switch Impact: LLMs faced significant challenges in the code-switched environment, highlighting the necessity for additional research and improved training on code-switched corpora to improve model robustness in handling such common linguistic phenomena.

Implications and Future Directions

This research contributes to a greater understanding of how current LLMs perform on syntactically complex Indic languages. It underscores the necessity for more specialized approaches in training models that can handle the unique syntactic and morphosyntactic challenges presented by these languages.

Future work should explore alternative training methodologies or architectural modifications that can enhance the capture of Indic-language-specific syntactic nuances. Furthermore, expanding this benchmark to cover more languages could provide a richer dataset for refining Indic NLP tools.

Overall, this paper lays a foundational groundwork for more syntactically focused evaluation and model development in multilingual and code-switched language contexts, crucial for advancing natural language processing technologies in linguistically diverse regions like South Asia.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Rajaswa Patil (8 papers)
  2. Jasleen Dhillon (1 paper)
  3. Siddhant Mahurkar (2 papers)
  4. Saumitra Kulkarni (4 papers)
  5. Manav Malhotra (1 paper)
  6. Veeky Baths (14 papers)
Citations (1)
Youtube Logo Streamline Icon: https://streamlinehq.com