Enhancing Transformers for Long Legal Document Processing: Modding LegalBERT and Longformer
Introduction
The processing of long legal documents presents unique challenges in the field of NLP. Traditional transformer models such as BERT are limited by their input length, typically capped at 512 sub-word tokens, which is insufficient for many legal documents that can be significantly longer. Sparse attention models like Longformer and BigBird offer some relief by extending the input capacity but still face limitations in processing the entirety of extra-long documents without truncation. Against this backdrop, this paper explores methods to adapt and extend transformer models, specifically LegalBERT and Longformer, to more effectively handle long legal texts.
Approaches to Long Document Processing
This paper investigates two main strategies to enhance the processing capabilities for long legal documents:
- Adapting Longformer with Legal Pre-training: The paper experiments with a Longformer model that has been warm-started from LegalBERT to handle texts up to 8,192 sub-words, aiming to leverage the legal domain knowledge encapsulated in LegalBERT while extending the input length capacity.
- Modifying LegalBERT with TF-IDF Representations: The second approach seeks to augment LegalBERT with Term Frequency-Inverse Document Frequency (TF-IDF) features, allowing the model to process longer texts indirectly by prioritizing the most relevant textual inputs based on their TF-IDF scores.
Key Findings
The modified Longformer, warm-started from LegalBERT and capable of processing up to 8,192 sub-words, achieved the best performance on LexGLUE long document classification tasks, outperforming the hierarchical version of LegalBERT. The introduction of TF-IDF modifications to LegalBERT, while not surpassing the tailored Longformer, still demonstrated considerable efficiency improvements over the linear SVM baseline when handling long legal texts.
Implications and Future Directions
These findings open new avenues for processing long legal documents with pre-trained transformers, offering pathways to both enhanced performance and computational efficiency. The success of the adapted Longformer model underscores the importance of domain-specific pre-training and the potential benefits of extending input length capabilities for complex text classification tasks. The TF-IDF augmented approach to LegalBERT presents an interesting compromise between efficiency and performance, leveraging traditional NLP techniques within a modern transformer framework for improved handling of long texts.
Future work may explore further optimizations and pre-training strategies tailored to the unique demands of legal document processing. Experimentation with additional sparse attention mechanisms and the integration of richer contextual embeddings could yield further improvements. Additionally, testing these adapted models on a broader range of legal NLP tasks beyond classification may illuminate their versatility and limitations, guiding the development of more robust solutions for legal text analysis.
Conclusion
This paper contributes to the ongoing exploration of adapting and enhancing transformer models for specialized domains such as legal document processing. By extending the capabilities of both LegalBERT and Longformer to accommodate longer texts, the research addresses a significant limitation in current NLP approaches to legal text and opens the door to more sophisticated and effective tools for legal practitioners and researchers alike.