Lawformer: A Pre-trained LLM for Chinese Legal Long Documents
The paper introduces Lawformer, a Longformer-based pre-trained LLM designed for the Chinese legal domain, particularly for understanding long legal documents. Lawformer addresses the challenge in LegalAI, where legal documents typically exceed the token limit of mainstream pre-trained LLMs such as BERT and RoBERTa.
Motivation and Model Design
LegalAI leverages AI technologies, notably NLP, to enhance legal systems' efficiency. Given the complexity and length of legal documents, traditional PLMs, which excel in generic domains, fail to perform adequately on legal texts that extend beyond their processing capacities. Lawformer uses a combination of local sliding window attention and global task-driven full attention to manage these long sequences efficiently. This architecture allows for linear time complexity in processing sequences, compared to the quadratic complexity of full self-attention generally used in transformers.
Pre-training and Model Evaluation
Lawformer is pre-trained on a vast collection of Chinese criminal and civil case documents. The authors utilize the masked LLMing objective to fine-tune the model from the RoBERTa base, adapting it to legal content. The pre-training process involved extensive datasets, processed to align with the real-world distribution of legal texts.
The model is evaluated on multiple LegalAI tasks:
- Judgment Prediction: Utilizing newly constructed datasets, Lawformer shows superior performance, especially in representing long distanced context critical for accurate predictions.
- Legal Case Retrieval: Lawformer outperforms traditional models on the LeCaRD dataset, showcasing its capability to retrieve relevant case documents with thousands of tokens.
- Reading Comprehension: On the CJRC dataset, Lawformer exhibits performance gains attributed to its in-domain adaptation, albeit close results with models pre-trained on non-legal corpora.
- Question Answering: Evaluated on the JEC-QA dataset, Lawformer demonstrates notable improvements, attributed to its sophisticated reasoning capabilities over extensive legal texts.
Implications and Future Work
The introduction of Lawformer significantly advances LegalAI by enabling the processing and understanding of long legal documents, a previously daunting challenge. The paper also provides a robust framework for future adaptation of large pre-trained models to domain-specific requirements, illustrating the potential of extended context learning.
The authors suggest further exploration in knowledge-augmented legal pre-training and generative models for legal tasks, aiming to embed legal reasoning and domain-specific knowledge more effectively. Such efforts may transform the operational landscape of legal practices by automating and enhancing document comprehension and case analysis.
In conclusion, Lawformer marks an important step in bridging the gap between general NLP advances and domain-specific applications, particularly in the legal sector. This work paves the way for developing more nuanced, context-aware LLMs tailored to domain-specific challenges.