PARAMANU-AYN: Pretrain from scratch or Continual Pretraining of LLMs for Legal Domain Adaptation? (2403.13681v2)
Abstract: In this paper, we present Paramanu-Ayn, a collection of legal LLMs trained exclusively on Indian legal case documents. This 97-million-parameter Auto-Regressive (AR) decoder-only model was pretrained from scratch with a context size of 8192 on a single GPU for just 185 hours, achieving an efficient MFU of 41.35. We also developed a legal domain specialized BPE tokenizer. We evaluated our model using perplexity and zero-shot tasks: case judgment prediction with explanation and abstractive case summarization. Paramanu-Ayn outperformed Llama-2 7B and Gemini-Pro in case judgment prediction with explanation task on test accuracy by nearly 2 percentage points, despite being 72 times smaller. In zero-shot abstractive summarization, it surpassed decoder-only LLMs generating fixed-length summaries (5000 tokens) by over 10 percentage points in BLEU and METEOR metrics, and by nearly 4 percentage points in BERTScore. Further evaluations on zero-shot commonsense and mathematical benchmarks showed that Paramanu-Ayn excelled despite being trained exclusively on legal documents, outperforming Llama-1, Llama-2, and Falcon on AGIEVAL-AQuA-RAT and AGIEVAL-SAT-Math tasks. We also instruction-tuned our model on 10,763 diverse legal tasks, including legal clause generation, legal drafting, case summarization, etc. The Paramanu-Ayn-instruct model scored above 8 out of 10 in clarity, relevance, completeness, and legal reasoning metrics by GPT-3.5-Turbo. We found that our models, were able to learn drafting knowledge and generalize to draft legal contracts and legal clauses with limited instruction-tuning. Hence, we conclude that for a strong domain-specialized generative LLM (such as legal), domain specialized pretraining from scratch is more cost effective, environmentally friendly, and remains competitive with larger models or even better than adapting LLMs for legal domain tasks.
- Ayansk11. 2023. Legal instructions.
- Longformer: The long-document transformer.
- LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases.
- Aakanksha Chowdhery et al. 2022. PaLM: Scaling language modeling with pathways.
- Lawbench: Benchmarking legal knowledge of large language models.
- Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
- Lawyer llama technical report.
- Reducing activation recomputation in large transformer models.
- ILDC for CJPE: Indian legal documents corpus for court judgment prediction and explanation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4046–4062, Online. Association for Computational Linguistics.
- NebulaSense. 2023. Legal clause instructions.
- Joel Niklaus and Daniele Giofré. 2022. Budgetlongformer: Can we cheaply pretrain a sota legal language model from scratch?
- Mitodru Niyogi and Arnab Bhattacharya. 2024. Paramanu: A family of novel efficient indic generative foundation language models.
- Pre-trained language models for the legal domain: A case study on indian law.
- Llama: Open and efficient foundation language models.
- Adapt-and-distill: Developing small, fast and effective pretrained language models for domains. In Findings.
- Ying Yin and Ivan Habernal. 2022. Privacy-preserving models for legal natural language processing.