L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models

Published 29 Aug 2025 in cs.CL and cs.LG | (2508.21569v1)

Abstract: We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 16,860 Marathi sentence pairs labeled with continuous similarity scores in the range of 0-5. To ensure balanced supervision, the dataset is uniformly distributed across six score-based buckets spanning the full 0-5 range, thus reducing label bias and enhancing model stability. We fine-tune the MahaSBERT model on this dataset and benchmark its performance against other alternatives like MahaBERT, MuRIL, IndicBERT, and IndicSBERT. Our experiments demonstrate that MahaSTS enables effective training for sentence similarity tasks in Marathi, highlighting the impact of human-curated annotations, targeted fine-tuning, and structured supervision in low-resource settings. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP

Abstract PDF Upgrade to Chat

Summary

The paper presents a balanced, human-annotated Marathi STS dataset with 16,860 sentence pairs labeled across uniformly distributed similarity buckets.
It introduces MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model with MEAN pooling, outperforming multilingual baselines with Pearson 0.9600 and Spearman 0.9523.
The findings emphasize that targeted, monolingual training and structured supervision improve semantic similarity tasks in low-resource languages like Marathi.

L3Cube-MahaSTS: A Human-Annotated Marathi Sentence Similarity Dataset and Models

Introduction

The paper presents L3Cube-MahaSTS, a large-scale, human-annotated Semantic Textual Similarity (STS) dataset for Marathi, alongside MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The work addresses the paucity of high-quality, labeled resources for Marathi, a low-resource Indic language, and demonstrates the efficacy of targeted fine-tuning and structured supervision for sentence similarity tasks. The dataset and models are made publicly available, facilitating further research and practical deployment in Marathi NLP.

Dataset Curation and Annotation

MahaSTS comprises 16,860 sentence pairs, each labeled with a continuous similarity score in the range 0–5. The annotation protocol employs six uniformly distributed buckets, each containing 2,810 pairs, to mitigate label bias and promote stable regression learning. The buckets are defined as follows:

Bucket 0: No semantic similarity
Bucket 1: Minimal similarity (0.1–1.0)
Bucket 2: Partial thematic overlap (1.1–2.0)
Bucket 3: Moderate similarity (2.1–3.0)
Bucket 4: High similarity (3.1–4.0)
Bucket 5: Near or full equivalence (4.1–5.0)

Sentence pairs were sourced from the L3Cube-MahaCorpus (1M sentences), filtered for quality, and paired using cosine similarity of MahaSBERT-STS embeddings. Human annotators refined the labels, discarding incomplete or nonsensical pairs, resulting in a balanced and high-quality dataset.

Figure 1: Examples of sentence pairs with labels in the range 0–5 from the L3Cube-MahaSTS dataset.

The dataset is split into train (85%), test (10%), and validation (5%) sets, maintaining bucket uniformity across splits. This design ensures robust evaluation and minimizes overfitting to specific similarity levels.

Model Architectures and Training

The primary model, MahaSBERT-STS-v2, is a Sentence-BERT variant for Marathi, initialized from MahaSBERT (trained on IndicXNLI) and fine-tuned on MahaSTS. The training regimen utilizes CosineSimilarityLoss, AdamW optimizer, a learning rate of $1 \times 10^{-5}$ , batch size 8, and MEAN pooling over token embeddings. Training is conducted for 2 epochs.

Baseline models include:

MahaBERT: Marathi BERT, fine-tuned on monolingual corpora.
MuRIL: Multilingual BERT for 17 Indian languages.
IndicBERT: Multilingual ALBERT for 12 Indic languages.
IndicSBERT: Sentence-BERT for 10 Indic languages, fine-tuned on NLI.

Pooling strategies (CLS, MEAN, MAX) are systematically evaluated, with MEAN pooling yielding the highest correlation with human judgments.

Experimental Results

On the MahaSTS test set, MahaSBERT-STS-v2 achieves a Pearson correlation of 0.9600 and Spearman correlation of 0.9523, outperforming all baselines. The results substantiate the claim that monolingual, task-specific fine-tuning is superior to multilingual or generic approaches for sentence similarity in Marathi.

Key findings:

MahaSBERT-STS-v2 (MEAN pooling): Pearson 0.9600, Spearman 0.9523
MahaBERT: Pearson 0.9483, Spearman 0.9386
MuRIL: Pearson 0.9361, Spearman 0.9267
IndicSBERT: Pearson 0.9515, Spearman 0.9441
IndicBERT: Pearson 0.7311, Spearman 0.7004

Pooling strategy analysis reveals MEAN pooling as optimal, with CLS and MAX pooling trailing slightly in performance.

Discussion

The results demonstrate that human-annotated, balanced datasets are critical for robust semantic similarity modeling in low-resource languages. The uniform bucket distribution in MahaSTS reduces label bias, enabling stable regression and generalization. The superiority of MahaSBERT-STS-v2 over multilingual models corroborates prior findings in other languages, emphasizing the value of monolingual, domain-specific pretraining and fine-tuning.

The dataset's design and annotation protocol facilitate nuanced semantic modeling, capturing a spectrum from complete dissimilarity to near equivalence. However, the model's generalization to longer or more complex sentences remains limited, a common issue in SBERT architectures for Indic languages. Addressing this requires future datasets with greater syntactic and semantic diversity.

Practical and Theoretical Implications

Practically, MahaSTS and MahaSBERT-STS-v2 enable high-fidelity semantic similarity estimation for Marathi, supporting downstream tasks such as RAG, IR, QA, paraphrase detection, and clustering. The public release of the dataset and models lowers the barrier for Marathi NLP research and deployment.

Theoretically, the work reinforces the importance of structured supervision and balanced annotation in regression-based NLP tasks. It also highlights the limitations of cross-lingual transfer and machine translation for semantic tasks in culturally rich, low-resource languages.

Future Directions

Potential avenues for future research include:

Expansion of MahaSTS to cover longer, more complex sentences and diverse domains.
Exploration of contrastive learning and advanced pooling strategies for improved sentence representations.
Cross-lingual transfer experiments leveraging MahaSTS for other Indic languages.
Integration of MahaSBERT-STS-v2 into large-scale retrieval and generation pipelines for Marathi.

Conclusion

L3Cube-MahaSTS establishes a new benchmark for semantic textual similarity in Marathi, providing a balanced, human-annotated dataset and a high-performing, fine-tuned SBERT model. The work demonstrates that targeted, monolingual fine-tuning on carefully curated data yields superior results in low-resource settings. The dataset and models are poised to accelerate research and practical applications in Marathi NLP, with implications for other Indic languages and low-resource domains.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (3)

Collections

GitHub

GitHub - l3cube-pune/MarathiNLP: Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language. (143 stars)

alphaXiv

L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models (3 likes, 0 questions)

L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models

Summary

L3Cube-MahaSTS: A Human-Annotated Marathi Sentence Similarity Dataset and Models

Introduction

Dataset Curation and Annotation

Model Architectures and Training

Experimental Results

Discussion

Practical and Theoretical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

GitHub

alphaXiv