LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain (2301.13126v3)
Abstract: Lately, propelled by the phenomenal advances around the transformer architecture, the legal NLP field has enjoyed spectacular growth. To measure progress, well curated and challenging benchmarks are crucial. However, most benchmarks are English only and in legal NLP specifically there is no multilingual benchmark available yet. Additionally, many benchmarks are saturated, with the best models clearly outperforming the best humans and achieving near perfect scores. We survey the legal NLP literature and select 11 datasets covering 24 languages, creating LEXTREME. To provide a fair comparison, we propose two aggregate scores, one based on the datasets and one on the languages. The best baseline (XLM-R large) achieves both a dataset aggregate score a language aggregate score of 61.3. This indicates that LEXTREME is still very challenging and leaves ample room for improvement. To make it easy for researchers and practitioners to use, we release LEXTREME on huggingface together with all the code required to evaluate models and a public Weights and Biases project with all the runs.
- Aralegal-bert: A pretrained language model for arabic legal text. In NLLP.
- Proceedings of the Natural Legal Language Processing Workshop 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
- Named entity recognition, linking and generation for greek legislation. In JURIX, volume 313 of Frontiers in Artificial Intelligence and Applications, pages 1–10. IOS Press.
- European Union language resources in Sketch Engine. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2799–2803, Portorož, Slovenia. European Language Resources Association (ELRA).
- gabert — an irish language model. In International Conference on Language Resources and Evaluation.
- Longformer: The long-document transformer. CoRR, abs/2004.05150.
- On the dangers of stochastic parrots: Can language models be too big? In FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, pages 610–623. ACM.
- Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: on meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5185–5198. Association for Computational Linguistics.
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs]. ArXiv: 2005.14165.
- Multilegalsbd: A multilingual legal sentence boundary detection dataset. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL ’23, page 42–51, New York, NY, USA. Association for Computing Machinery.
- Ilias Chalkidis. 2023. Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue benchmark. CoRR, abs/2304.12202.
- MultiEURLEX - a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6974–6996, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics.
- Paragraph-level rationale extraction through regularization: A case study on European court of human rights cases. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 226–241, Online. Association for Computational Linguistics.
- LeXFiles and LegalLAMA: Facilitating English multinational legal language model development. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15513–15535, Toronto, Canada. Association for Computational Linguistics.
- LexGLUE: A benchmark dataset for legal language understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland. Association for Computational Linguistics.
- FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing. arXiv:2203.07228 [cs]. ArXiv: 2203.07228.
- German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6788–6796, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs]. ArXiv: 2204.02311.
- Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents.
- Victor Hugo Ciurlino. 2021. Bertbr: a pretrained language model for law texts. Master’s thesis, Universidade de Brasília.
- European Commission. 2005. A new framework strategy for multilingualism.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Spanish datasets for sensitive entity detection in the legal domain. In Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC’22), Marseille, France, june. European Language Resource Association (ELRA). Dataset URL: https://tinyurl.com/mv65cp66.
- BERTIN: efficient pre-training of a spanish language model using perplexity sampling. Proces. del Leng. Natural, 68:13–23.
- Bertje: A dutch bert model. ArXiv, abs/1912.09582.
- RobBERT: a Dutch RoBERTa-based Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3255–3265, Online. Association for Computational Linguistics.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- JuriBERT: A masked-language model adaptation for French legal text. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 95–101, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- A corpus for multilingual analysis of online terms of service. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 1–8. Dataset URL: http://claudette.eui.eu/corpora/.
- The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
- Candida Maria Greco and Andrea Tagarelli. 2023. Bringing order into the realm of transformer-based language models for artificial intelligence and law. Artificial Intelligence and Law, To be published.
- Legalbench: Prototyping a collaborative benchmark for legal reasoning. CoRR, abs/2209.06120.
- Spanish legalese language model and corpora. ArXiv, abs/2110.12201.
- MarIA: Spanish language models. ArXiv, abs/2107.07253.
- DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. CoRR, pages 1–17.
- Deberta: Decoding-enhanced bert with disentangled attention. ArXiv, abs/2006.03654.
- Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. ArXiv:2207.00220 [cs].
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR.
- LegalRelectra: Mixed-domain Language Modeling for Long-range Legal Text Comprehension. ArXiv:2212.08204 [cs].
- Dane: A named entity resource for danish. In International Conference on Language Resources and Evaluation.
- A multi-task benchmark for korean legal language understanding and judgement prediction. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284–299.
- What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
- Natural Language Processing in the Legal Domain.
- Reduced, reused and recycled: The life of a dataset in machine learning research. In NeurIPS Datasets and Benchmarks.
- Greek-bert: The greeks visiting sesame street. In 11th Hellenic Conference on Artificial Intelligence, SETN 2020, page 110–117, New York, NY, USA. Association for Computing Machinery.
- Predicting brazilian court decisions. PeerJ Computer Science, 8:e904. Dataset URL: https://github.com/proflage/predicting-brazilian-court-decisions.
- Holistic Evaluation of Language Models. ArXiv:2211.09110 [cs].
- Daniele Licari and Giovanni Comandé. 2022. ITALIAN-LEGAL-BERT: A pre-trained transformer language model for italian law. In EKAW (Companion), volume 3256 of CEUR Workshop Proceedings. CEUR-WS.org.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, abs/2111.09543(1).
- Lener-br: a dataset for named entity recognition in brazilian legal text. In International Conference on Computational Processing of the Portuguese Language, pages 313–323. Springer. Dataset URL: https://huggingface.co/datasets/lener_br.
- Playing with words at the national library of sweden - making a swedish bert. ArXiv, abs/2007.01658.
- CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online. Association for Computational Linguistics.
- jurBERT: A Romanian BERT model for legal judgement prediction. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 86–94, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- B.W. Matthews. 1975. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451.
- Dávid Márk Nemeskey. 2020. Natural Language Processing Methods for Language Modeling. Ph.D. thesis, Eötvös Loránd University.
- Swiss-judgment-prediction: A multilingual legal judgment prediction benchmark. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 19–35, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Joel Niklaus and Daniele Giofre. 2023. Can we pretrain a SotA legal language model on a budget from scratch? In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 158–182, Toronto, Canada (Hybrid). Association for Computational Linguistics.
- Automatic Anonymization of Swiss Federal Supreme Court Rulings. ArXiv:2310.04632 [cs].
- Multilegalpile: A 689gb multilingual legal corpus.
- An empirical study on cross-X transfer for legal judgment prediction. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 32–46, Online only. Association for Computational Linguistics.
- Named entity recognition in the Romanian legal domain. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 9–18, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Multi-granular legal topic classification on Greek legislation. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 63–75, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Transfer learning in biomedical natural language processing: An evaluation of BERT and elmo on ten benchmarking datasets. In BioNLP@ACL, pages 58–65. Association for Computational Linguistics.
- Slovakbert: Slovak masked language model. CoRR, abs/2109.15254.
- AI and the everything in the whole wide world benchmark. In NeurIPS Datasets and Benchmarks.
- Evaluating protein transfer learning with TAPE. In NeurIPS, pages 9686–9698.
- Scale: Scaling up the complexity for advanced language model evaluation.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
- ClassActionPrediction: A challenging benchmark for legal judgment prediction of class action cases in the US. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 31–46, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Multi-lexsum: Real-world summaries of civil rights lawsuits at multiple granularities. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Czert – Czech BERT-like model for language representation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1326–1338, Held Online. INCOMA Ltd.
- Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I, page 403–417, Berlin, Heidelberg. Springer-Verlag.
- Andrea Tagarelli and Andrea Simeri. 2022. Lamberta: Law article mining based on bert architecture for the italian civil code. In ICRDL.
- EstBERT: A pretrained language-specific BERT for Estonian. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 11–19, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
- Shavrina Tatiana and Malykh Valentin. 2021. How not to lie with a benchmark: Rearranging NLP leaderboards. In I (Still) Can’t Believe It’s Not Better Workshop at NeurIPS 2021, volume abs/2112.01342.
- Rena Torres Cacoullos. 2020. Code-switching strategies: Prosody and syntax. Frontiers in Psychology, 11.
- LLaMA: Open and Efficient Foundation Language Models.
- SUPERB-SG: enhanced speech processing universal performance benchmark for semantic and generative capabilities. In ACL (1), pages 8479–8492. Association for Computational Linguistics.
- Dimitrios Tsarapatsanis and Nikolaos Aletras. 2021. On the ethical limits of natural language processing on legal text. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3590–3599, Online. Association for Computational Linguistics.
- A multilingual approach to identify and classify exceptional measures against covid-19. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 46–62. Dataset URL: https://tinyurl.com/ycysvtbm.
- Design and implementation of german legal decision corpora. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,, pages 515–521. SciTePress.
- Thirty years of artificial intelligence and law: the third decade. Artificial Intelligence and Law, 30:561–591.
- Multilingual is not enough: Bert for finnish. ArXiv, abs/1912.07076.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, pages 3261–3275.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR (Poster). OpenReview.net.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
- Lawformer: A pre-trained language model for chinese legal long documents. AI Open, 2:79–84.
- CLUE: A Chinese language understanding evaluation benchmark. In COLING, pages 4762–4772, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- mT5: A massively multilingual pre-trained text-to-text transformer. arXiv:2010.11934 [cs]. ArXiv: 2010.11934.
- SUPERB: speech processing universal performance benchmark. In Interspeech, pages 1194–1198. ISCA.
- Ying Yin and Ivan Habernal. 2022. Privacy-preserving models for legal natural language processing. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 172–183, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- CBLUE: A Chinese biomedical language understanding evaluation benchmark. In ACL, pages 7888–7915, Dublin, Ireland. Association for Computational Linguistics.
- When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, ICAIL ’21, page 159–168, New York, NY, USA. Association for Computing Machinery.
- How does NLP benefit legal system: A summary of legal artificial intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5218–5230, Online. Association for Computational Linguistics.
- Joel Niklaus (21 papers)
- Veton Matoshi (3 papers)
- Pooja Rani (20 papers)
- Andrea Galassi (9 papers)
- Matthias Stürmer (13 papers)
- Ilias Chalkidis (40 papers)