JUST-NLP 2025 Legal MT Shared Task
- JUST-NLP 2025 Legal MT Shared Task is a benchmark framework for sentence-level translation of legal texts from English to Hindi using a 60,000 pair dataset.
- Domain adaptation via fine-tuning a pre-trained OPUS-MT model achieved 46.03 BLEU, outperforming models trained from scratch and enhancing fluency in legal language.
- The task offers practical benefits by reducing post-editing efforts and supporting scalable multilingual legal workflows in jurisdictions like India.
Legal Machine Translation (L-MT) addresses the critical challenge of language barriers in legal communication within multilingual jurisdictions such as India, where most legal and judicial documentation is in English. The JUST-NLP 2025 Legal MT Shared Task provided a standard evaluation framework for English→Hindi sentence-level translation of legal texts, with systems benchmarked on their ability to accurately and fluently translate domain-specific materials, thereby improving access to justice and supporting scalable multilingual legal workflows (Barman et al., 21 Dec 2025).
1. Shared Task Definition and Dataset
The JUST-NLP 2025 Legal MT Shared Task focused on the sentence-level translation of legal texts from English to Hindi. Participants were provided a parallel corpus comprising 60,000 pairs of English–Hindi legal sentences, partitioned as follows: 50,000 for training, 5,000 for validation (withheld references), and 5,000 for testing (references released after submission). The source material spanned statutes, judgments, and notifications, representing typical legal clause length variability and long-range dependencies. Preprocessing pipelines included standard tokenization with SentencePiece (vocabulary size 32,000 for from-scratch systems), truncation/padding to a maximum of 128 tokens for fine-tuned systems, and text normalization per HuggingFace/Marian defaults (e.g., lowercasing, punctuation normalization).
2. Model Architectures and Training Regimes
2.1 Pre-trained Baseline and Fine-tuning
The baseline employed an OPUS-MT Marian-style Transformer, with 6 encoder and 6 decoder layers (model dimension 512, feed-forward dimension 2048, 8 attention heads per block, sinusoidal positional embeddings). For fine-tuning, this model was further adapted to the legal domain using only the provided legal corpus, identifying optimal parameters: AdamW optimizer (learning rate 2×10⁻⁵, weight decay 0.01), batch size 32, maximum 3–5 epochs with early stopping, and a strict sequence cap of 128 tokens to balance context preservation and memory constraints. No additional adapters were introduced; domain adaptation relied solely on continued training with in-domain pairs.
2.2 Transformer Trained from Scratch
A custom Transformer was also trained entirely from scratch on the same legal data, using a SentencePiece unigram vocabulary (size 32,000), 4 encoder and 4 decoder layers (model dimension 128, inner-FF 512), 8 attention heads, dropout 0.1, and a maximum sequence length of 256. The optimization followed the standard recipe (Adam with β₁=0.9, β₂=0.98, warmup and inverse square-root learning rate decay, batch size 32). Regularization included label smoothing (0.1), with training continued for approximately 50,000 steps until validation convergence.
3. Evaluation Metrics and Methodology
Performance was assessed using a comprehensive suite of machine translation metrics:
- SacreBLEU: Geometric mean of n-gram precisions with brevity penalty:
where is clipped n-gram precision, , and is brevity penalty.
- chrF++: Character n-gram F-score (word unigrams included):
Usually ; Precision/Recall over character 1–6-grams and word unigrams.
- TER (Translation Edit Rate): Percentage of edits per reference token.
- ROUGE: Includes ROUGE-N (n-gram recall) and ROUGE-L (longest common subsequence F1).
- BERTScore: Token-level cosine similarity between BERT-embedded reference and hypothesis, aggregated into F1.
- METEOR: Word-level matching (unigram, stemming, synonymy), F-score penalized for fragmentation.
- COMET: Regression-based, leveraging human judgments and all input/reference/hypothesis triplets.
4. Comparative Results
Quantitative results demonstrated a substantial advantage for domain-adaptive fine-tuning over baseline and from-scratch approaches.
| Model | BLEU ↑ | chrF++ ↑ | TER ↓ | ROUGE-1 ↑ | ROUGE-2 ↑ | ROUGE-L ↑ | BERTScore F1 ↑ | METEOR ↑ | COMET ↑ |
|---|---|---|---|---|---|---|---|---|---|
| OPUS-MT (baseline) | 9.39 | 27.66 | 83.40 | 36.30 | 13.38 | 32.93 | 76.91 | 30.25 | 50.80 |
| Transformer (from scratch) | 37.77 | 60.88 | 59.72 | 35.98 | 13.62 | 35.69 | 88.37 | 65.58 | 64.29 |
| OPUS-MT (fine-tuned) | 46.03 | 70.59 | 42.08 | 72.42 | 52.63 | 69.05 | 91.19 | 71.84 | 73.72 |
The fine-tuned OPUS-MT achieved a SacreBLEU of 46.03, outpacing the custom from-scratch model by 22 BLEU points and the baseline by 36 BLEU points. Similar trends were observed across chrF++, TER, ROUGE, BERTScore, METEOR, and COMET, emphasizing both fluency and adequacy gains due to domain adaptation.
5. Analysis of Domain Adaptation vs. In-Domain Training
The superior performance of the fine-tuned OPUS-MT model is attributable to pre-trained bilingual priors developed on millions of general-domain pairs, which expedite adaptation to legal terminology and style even with only 50k domain-specific pairs. Fine-tuning these priors yields rapid gains in terminology consistency and reduces errors typical in L-MT, especially in handling specialized vocabulary and long-range clause dependencies. No catastrophic forgetting was observed over several adaptation epochs.
Training entirely from scratch, even with a model fit to the domain corpus, resulted in limited generalization, lower fluency, and higher TER. The shallow architecture mitigated overfitting risk for the moderate data size, but broader coverage and lexical richness were diminished relative to transfer-based approaches. This suggests severe data limitation for deeper or larger in-domain-only models.
6. Practical Implications and Recommendations
The achievement of high evaluation metrics (46 BLEU, 42 TER) for domain-adapted L-MT systems implies low post-editing effort, supporting integration into professional legal workflows and enhancing translation throughput for courts and regulatory bodies. The approach scales to other scheduled Indian languages, enabling deployment for additional multilingual contexts. Recommendations include:
- Prefer strong multilingual or bilingual baselines with domain adaptation over training deep models from scratch on moderate in-domain data.
- Cap sequence length judiciously (128 tokens) to avoid truncating legal clauses while managing GPU memory demands.
- Adopt domain adaptation best practices (AdamW, low learning rate, early stopping on validation).
- Future research should investigate parameter-efficient adaptations (LoRA, QLoRA), back-translation augmentation, and human-in-the-loop validation processes.
- Task extension to more languages, document-level context, and expert adequacy evaluations is encouraged.
7. Lessons Learned and Future Directions
Results from the JUST-NLP 2025 Legal MT Shared Task reinforce the effectiveness of domain-adaptive pre-training for specialized L-MT settings. The findings illustrate that scalable, high-quality legal translation is feasible with limited in-domain parallel data when leveraging transfer from strong general-purpose MT models. A plausible implication is that further advances may be obtained by integrating parameter-efficient fine-tuning, more sophisticated data augmentation, and context modeling for longer legal spans. The shared task framework provides a foundation for systematic progress in L-MT and broader access to legal information and justice in multilingual societies (Barman et al., 21 Dec 2025).