Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SaulLM-7B: A pioneering Large Language Model for Law (2403.03883v2)

Published 6 Mar 2024 in cs.CL
SaulLM-7B: A pioneering Large Language Model for Law

Abstract: In this paper, we introduce SauLLM-7B, a LLM tailored for the legal domain. With 7 billion parameters, SauLLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SauLLM-7B is trained on an English legal corpus of over 30 billion tokens. SauLLM-7B exhibits state-of-the-art proficiency in understanding and processing legal documents. Additionally, we present a novel instructional fine-tuning method that leverages legal datasets to further enhance SauLLM-7B's performance in legal tasks. SauLLM-7B is released under the MIT License.

Exploring the Legal Frontier: Insights from the \ourmodel{} LLM

Introduction

The landscape of artificial intelligence and LLMing has seen significant advancements, yet the legal domain has often remained on the periphery of these technological leaps. Addressing this gap, the recently developed \ourmodel{} emerges as a pioneering effort to tailor LLMs to the intricacies of legal text comprehension and generation. This novel initiative not only focuses on enhancing legal document processing but also aims to contribute to the broader application of LLMs within the field of law.

The \ourmodel{} Family and Its Innovations

A Tailored Approach to Legal Language

At its core, \ourmodel{} is built upon the Mistral 7B architecture and is extensively pre-trained on a vast corpus of over 30 billion tokens, specifically curated from the legal domain. This dedicated approach ensures that \ourmodel{} can navigate the nuanced terrain of legal jargon and syntax more effectively than its generalist counterparts. Furthermore, the introduction of \ourmodelift{}, an instruction-tuned variant of \ourmodel{}, signifies a leap towards improved operational performance on legal tasks by incorporating both generic and customized legal instructions during its fine-tuning phase.

Enhanced Evaluation Protocols

The paper also introduces \legalbench{}, a refined benchmark that promises to offer more nuanced insights into the legal proficiency of LLMs. This improved benchmark, supplemented by tasks from the popular MMLU, sets a new standard for evaluating the capabilities of legal LLMs, encouraging more focused advancements in the domain.

Diving Deeper: Training and Data Considerations

The Training Journey

The development of \ourmodel{} involved a rigorously structured two-step training process. Initially, a substantial pre-training phase grounded the model in the domain of legal text, utilizing a diverse array of legal documents from multiple jurisdictions. Subsequently, the model underwent an instructional fine-tuning phase, which not only reinforced its ability to follow domain-specific instructions but also honed its comprehension skills through an eclectic mix of general and legal-specific training data.

Data Collection and Curation

A cornerstone of \ourmodel{}'s development was the meticulous assembly of its training corpus. Drawing from an array of legal texts and documents, the team embarked on an extensive data collection and cleansing operation. The final dataset, encompassing 30 billion tokens, was derived from both publicly accessible legal information and strategically selected sources, ensuring a wide representation of legal knowledge and contexts.

The Implications and Prospects of \ourmodel{}

Practical and Theoretical Contributions

From a practical standpoint, \ourmodel{} and \ourmodelift{} stand to significantly benefit legal professionals by providing a tool adept in handling the complexity of legal documents. Theoretical contributions, on the other hand, stem from the strides made in adapting LLMs to domain-specific requirements, thereby expanding our understanding of LLM training and application.

Future Horizons

The release of \ourmodel{} under an open license not only promotes widespread usage and experimentation but also invites further research and development within the legal AI domain. As this field continues to evolve, future endeavors might explore improved instructional fine-tuning methods, the integration of multi-jurisdictional legal systems, and the application of legal LLMs in predictive and analytical tasks within the legal profession.

Conclusion

In essence, \ourmodel{} represents a significant stride towards realizing the potential of LLMs within the legal sector. By adeptly navigating the intricacies of legal language and providing a robust tool for document analysis, this model paves the way for a new era of legal research and practice, empowered by advanced AI capabilities. The open-source nature of \ourmodel{} further underscores the collaborative spirit of this endeavor, encouraging continued innovation and exploration at the intersection of law and artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Predicting judicial decisions of the european court of human rights: A natural language processing perspective. PeerJ computer science, 2:e93.
  3. The falcon series of open language models.
  4. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  5. Qwen technical report.
  6. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  7. Searching for needles in a haystack: On the role of incidental bilingualism in palm’s translation capability. arXiv preprint arXiv:2305.10266.
  8. Umar Butler. 2023. Open australian legal corpus.
  9. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290.
  10. Neural legal judgment prediction in english. arXiv preprint arXiv:1906.02059.
  11. Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559.
  12. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
  13. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
  14. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  15. Together Computer. 2023. Redpajama: an open dataset for training large language models.
  16. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
  17. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  18. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  19. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  20. What’s in my big data?
  21. Croissantllm: A truly bilingual french-english language model. arXiv preprint arXiv:2402.00786.
  22. Revisiting instruction fine-tuned model evaluation to guide industrial applications. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  23. The pile: An 800gb dataset of diverse text for language modeling.
  24. Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
  25. Legalbench: Prototyping a collaborative benchmark for legal reasoning. arXiv preprint arXiv:2209.06120.
  26. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. arXiv preprint arXiv:2308.11462.
  27. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
  28. Spanish legalese language model and corpora. arXiv preprint arXiv:2110.12201.
  29. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics.
  30. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217–29234.
  31. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  32. Distinguishing human generated text from chatgpt generated text using machine learning. arXiv preprint arXiv:2306.01761.
  33. Domain-specific continued pretraining of language models for capturing long context in mental health. arXiv preprint arXiv:2304.10447.
  34. Mistral 7b.
  35. Mixtral of experts.
  36. Gpt-4 passes the bar exam. Available at SSRN 4389233.
  37. The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research.
  38. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
  39. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
  40. Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846.
  41. Starcoder: may the source be with you!
  42. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification.
  43. Daniele Licari and Giovanni Comandè. 2022. Italian-legal-bert: A pre-trained transformer language model for italian law. In CEUR Workshop Proceedings (Ed.), The Knowledge Management for Law Workshop (KM4LAW).
  44. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  45. Prompt discriminative language models for domain adaptation. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 247–258.
  46. At which training stage does code data help llms reasoning?
  47. Better call gpt, comparing large language models against lawyers. arXiv preprint arXiv:2401.16212.
  48. Michael McCloskey and Neal J. Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109–165. Academic Press.
  49. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  50. Chenghaomou/text-dedup: Reference snapshot.
  51. Orca: Progressive learning from complex explanation traces of gpt-4.
  52. Nash learning from human feedback. arXiv preprint arXiv:2312.00886.
  53. Swiss-judgment-prediction: A multilingual legal judgment prediction benchmark. arXiv preprint arXiv:2110.00806.
  54. Joel Niklaus and Daniele Giofré. 2022. Budgetlongformer: Can we cheaply pretrain a sota legal language model from scratch? arXiv preprint arXiv:2211.17135.
  55. Joel Niklaus and Daniele Giofré. 2023. Can we pretrain a sota legal language model on a budget from scratch? Association for Computational Linguistics.
  56. Multilegalpile: A 689gb multilingual legal corpus.
  57. Unsupervised domain adaptation of language models for reading comprehension. arXiv preprint arXiv:1911.10768.
  58. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  59. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  60. Henry Prakken. 2013. Logical tools for modelling legal argument: a study of defeasible reasoning in law, volume 32. Springer Science & Business Media.
  61. Robust speech recognition via large-scale weak supervision.
  62. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  63. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  64. Explaining legal concepts with augmented large language models (gpt-4). arXiv preprint arXiv:2306.09525.
  65. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  66. Slimpajama-dc: Understanding data combinations for llm training. arXiv preprint arXiv:2309.10818.
  67. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
  68. Slimpajama: A 627b token cleaned and deduplicated version of redpajama.
  69. Distill and replay for continual language learning. In Proceedings of the 28th international conference on computational linguistics, pages 3569–3579.
  70. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
  71. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  72. Llama 2: Open foundation and fine-tuned chat models.
  73. Ledgar: A large-scale multi-label corpus for text classification of legal provisions in contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1235–1241.
  74. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  75. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
  76. Effective unsupervised domain adaptation with adversarially trained language models. arXiv preprint arXiv:2010.01739.
  77. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  78. Self-instruct: Aligning language models with self-generated instructions.
  79. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
  80. Testing of detection tools for ai-generated text. International Journal for Educational Integrity, 19(1):26.
  81. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  82. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  83. Adapting large language models for document-level machine translation. arXiv preprint arXiv:2401.06468.
  84. Lawformer: A pre-trained language model for chinese legal long documents. AI Open, 2:79–84.
  85. A paradigm shift in machine translation: Boosting translation performance of large language models. arXiv preprint arXiv:2309.11674.
  86. Adapt-and-distill: Developing small, fast and effective pretrained language models for domains. arXiv preprint arXiv:2106.13474.
  87. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  88. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
  89. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  90. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Pierre Colombo (48 papers)
  2. Telmo Pessoa Pires (5 papers)
  3. Malik Boudiaf (22 papers)
  4. Dominic Culver (9 papers)
  5. Rui Melo (6 papers)
  6. Caio Corro (17 papers)
  7. Fabrizio Esposito (8 papers)
  8. Vera Lúcia Raposo (1 paper)
  9. Sofia Morgado (2 papers)
  10. Michael Desa (2 papers)
  11. Andre F. T. Martins (5 papers)
Citations (43)

HackerNews