Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vedic Sanskrit Dependency Parsing

Updated 13 February 2026
  • Vedic Sanskrit Dependency Parsing is defined as the computational technique for recovering syntactic head–dependent relations in morphologically rich, free-word-order texts.
  • Neural models such as graph-based biaffine parsers and byte-level sequence-to-sequence Transformers have achieved state-of-the-art UAS/LAS improvements on annotated Vedic datasets.
  • Error analyses reveal persistent challenges with morphological ambiguity and non-projectivity, suggesting improvements via joint learning and data augmentation.

Vedic Sanskrit Dependency Parsing refers to the computational process of recovering syntactic head–dependent relations in Vedic Sanskrit texts, a low-resource, morphologically rich, and free-word-order subset of Sanskrit. Dependency parsing in this domain is exceptionally challenging due to complex inflectional morphology, frequent non-projective structures, and limited annotated resources. Recent advances leverage byte-level sequence-to-sequence modeling and neural graph-based systems to address these challenges, achieving new state-of-the-art results and uncovering architectural best practices and persistent limitations.

1. Linguistic and Computational Context

Vedic Sanskrit, as represented in canonical texts such as the Ṛgveda, manifests high inflectional variation and extensive use of sandhi, compounding, and free constituent ordering. Traditional dependency annotation schemes for Sanskrit, such as those in the Digital Corpus of Sanskrit (DCS), adopt gold standards for parts of speech, morphosyntax, and dependency labels tailored for these structures. Early computational approaches relied on linguistic heuristics or Paninian grammar formalisms. In contrast, contemporary methods are predominantly data-driven, exploiting advances in deep learning for morphologically rich languages (Krishna et al., 2020).

2. Data Resources and Annotation Schemes

The critical resource for Vedic Sanskrit dependency parsing is the DCS Vedic subsection, which provides 24,807 sentences annotated with gold POS, morphological features, and dependency relations. The standard experimental split is 90% training, 5% development, and 5% test, with Rigvedic sentences exclusively reserved for training to prevent genre leakage (Nehrdich et al., 2024). Annotation follows dependency grammar conventions, with linearized head-dependent encoding for each token.

Earlier works employed the Sanskrit Treebank Corpus (STBC), which includes 1,500 prose/anvaya canonical order sentences and verse–prose aligned test sets from classical poetry. These corpora focus on classical Sanskrit but highlight core challenges analogous to Vedic texts, such as free word order and morphological ambiguity (Krishna et al., 2020).

3. Model Architectures and Parsing Algorithms

Contemporary Vedic Sanskrit dependency parsing employs two major neural paradigms:

  • Graph-Based Parsers with Biaffine Scoring (BiAFF, DCST): These models represent parsing as global tree inference via biaffine classifiers atop deep BiLSTM encoders. Input embeddings concatenate word forms and morphological tags. DCST extends BiAFF with self-training: auxiliary encoders learn from silver-standard parses transformed into sequence tagging tasks, enhancing robustness (Krishna et al., 2020).
  • Byte-Level Sequence-to-Sequence Transformers (ByT5-Sanskrit): ByT5-Sanskrit uses an encoder–decoder Transformer architecture that treats input and output as byte sequences in IAST transliteration. Dependency parsing is defined as a sequence-generation task, with a special task prefix and output format:

1#h1#r12#h2#r2n#hn#rn1\#h_1\#r_1\,|\,2\#h_2\#r_2\,|\,\cdots\,|\,n\#h_n\#r_n

where hih_i is the head index (0 = ROOT) and rir_i the dependency relation for token ii. The model, pretrained on 6.5B IAST characters, is fine-tuned for dependency parsing by optimizing the maximum-likelihood cross-entropy over the serialized tree output. During training, random concatenation of up to four sentences augments context window and exposure to broader structures (Nehrdich et al., 2024).

4. Evaluation Metrics and Empirical Results

Dependency parsing performance is measured using Unlabeled Attachment Score (UAS) and Labeled Attachment Score (LAS):

UAS=#{i:h^i=hi}#tokens×100%\text{UAS} = \frac{\#\{i: \hat h_i = h_i\}}{\#\text{tokens}} \times 100\%

LAS=#{i:(h^i=hi)(r^i=ri)}#tokens×100%\text{LAS} = \frac{\#\{i: (\hat h_i = h_i) \land (\hat r_i = r_i)\}}{\#\text{tokens}} \times 100\%

On the DCS Vedic dataset, ByT5-Sanskrit achieves the following results compared to a biaffine baseline:

Setting Parser UAS LAS
None Biaffine 77.68 70.67
None ByT5-Sanskrit 86.54 81.54
All Biaffine 86.86 81.98
All ByT5-Sanskrit 89.04 84.58
  • "None": only surface forms provided. "All": gold POS, morphosyntax, and punctuation added as input.
  • ByT5-Sanskrit surpasses the best prior graph-based parsers by +8.86 UAS/+10.87 LAS (“None”) and +2.18 UAS/+2.60 LAS (“All”) (Nehrdich et al., 2024).

Transition-based and graph-based models (YAP, L2S, BiAFF, DCST) trained on classical Sanskrit achieve up to 81.08 UAS / 74.37 LAS on prose, but performance collapses on metrical (verse) word-order inputs (e.g., DCST: 40.02/35.70). All models struggle with out-of-domain verse due to non-projectivity and free word order—mirroring Vedic text challenges (Krishna et al., 2020).

5. Morphological Richness and Model Adaptation

Vedic Sanskrit presents extensive morphological ambiguity (e.g., ablative vs. genitive singular endings, polysemous lemmas, and rare participial forms). ByT5-Sanskrit’s byte-level approach, which operates directly on UTF-8 bytes of IAST text, avoids subword or morphological vocabulary bottlenecks and generalizes across sandhi-induced variation and novel inflectional forms. Manual error analysis confirms that performance dips primarily on rare or morphologically ambiguous constructions; ablation of gold features (removal of POS, morphosyntax) reduces LAS by only ~2.5 points, indicating that the model’s learned representations recover substantial morphosyntactic structure from surface form alone (Nehrdich et al., 2024).

In legacy neural architectures, character-n-gram and morph-tag embeddings provide signals to disambiguate forms, but their robustness is more limited when confronted with free word order and sandhi (Krishna et al., 2020).

6. Error Analysis, Limitations, and Directions for Improvement

Manual analysis of model outputs reveals that errors most often arise from (i) subtle morphological ambiguities, (ii) rare or highly irregular participles, and (iii) annotation inconsistencies in gold data. ByT5-Sanskrit sometimes predicts dependencies that “correct” errors present in the DCS annotation, suggesting a role for model outputs in corpus refinement (Nehrdich et al., 2024).

Persistent limitations include unresolved homonymy (7.5% of DCS lemmas are polysemous), poor handling of rare/irregular morphological forms, and challenges with extremely long sentences or non-projective structures—particularly pronounced in free-order poetic or Vedic texts. All neural systems studied experience a sharp drop in UAS/LAS on verse-ordered test sets, losing more than half their accuracy compared to prose (Krishna et al., 2020).

Proposed improvements include:

  • Joint learning of lemma ID and syntactic head assignment (to address homonymy).
  • Expanding multitask multitask approaches to simultaneously perform morphological tagging and dependency parsing.
  • Data augmentation with synthesized verse-order treebanks and cross-lingual transfer.
  • Integrating Paninian or karaka-based constraints into the decoding objective to narrow the search space under free ordering (Krishna et al., 2020, Nehrdich et al., 2024).

7. Implications for Sanskrit NLP and Broader Applications

ByT5-Sanskrit establishes a new state of the art for dependency parsing in Vedic Sanskrit, achieving robust performance without custom graph layers, feature engineering, or explicit tokenization. Its byte-level, end-to-end modeling strategy is deployable as a single large multitask model not only for parsing, but also for segmentation, lemmatization, and morphosyntactic tagging, with documented use as a preprocessing component in machine translation pipelines and information retrieval systems.

These advances suggest that byte-level pretrained models can generalize effectively to other morphologically rich, low-resource languages and motivate investigations of cross-lingual dependency modeling and tree-to-tree transduction in historical linguistics and philology (Nehrdich et al., 2024).


References:

"One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks" (Nehrdich et al., 2024) "Neural Approaches for Data Driven Dependency Parsing in Sanskrit" (Krishna et al., 2020)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vedic Sanskrit Dependency Parsing.