Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Caseformer: Pre-training for Legal Case Retrieval Based on Inter-Case Distinctions (2311.00333v2)

Published 1 Nov 2023 in cs.IR

Abstract: Legal case retrieval aims to help legal workers find relevant cases related to their cases at hand, which is important for the guarantee of fairness and justice in legal judgments. While recent advances in neural retrieval methods have significantly improved the performance of open-domain retrieval tasks (e.g., Web search), their advantages have not been observed in legal case retrieval due to their thirst for annotated data. As annotating large-scale training data in legal domains is prohibitive due to the need for domain expertise, traditional search techniques based on lexical matching such as TF-IDF, BM25, and Query Likelihood are still prevalent in legal case retrieval systems. While previous studies have designed several pre-training methods for IR models in open-domain tasks, these methods are usually suboptimal in legal case retrieval because they cannot understand and capture the key knowledge and data structures in the legal corpus. To this end, we propose a novel pre-training framework named Caseformer that enables the pre-trained models to learn legal knowledge and domain-specific relevance information in legal case retrieval without any human-labeled data. Through three unsupervised learning tasks, Caseformer is able to capture the special language, document structure, and relevance patterns of legal case documents, making it a strong backbone for downstream legal case retrieval tasks. Experimental results show that our model has achieved state-of-the-art performance in both zero-shot and full-data fine-tuning settings. Also, experiments on both Chinese and English legal datasets demonstrate that the effectiveness of Caseformer is language-independent in legal case retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Learning a deep listwise context model for ranking refinement. In The 41st international ACM SIGIR conference on research & development in information retrieval. 135–144.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
  3. Statistics for experimenters. Vol. 664. John Wiley and sons New York.
  4. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. 129–136.
  5. LEGAL-BERT: The muppets straight out of law school. arXiv preprint arXiv:2010.02559 (2020).
  6. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932 (2020).
  7. THUIR at WSDM Cup 2023 Task 1: Unbiased Learning to Rank. arXiv preprint arXiv:2304.12650 (2023).
  8. Axiomatically Regularized Pre-training for Ad hoc Search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1524–1534.
  9. Paul R Cohen. 1995. Empirical methods for artificial intelligence. Vol. 139. MIT press Cambridge, MA.
  10. Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449 (2018).
  11. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3504–3514.
  12. Zhuyun Dai and Jamie Callan. 2019. Deeper text understanding for IR with contextual neural language modeling. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 985–988.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  14. Pre-training Methods in Information Retrieval. arXiv preprint arXiv:2111.13853 (2021).
  15. Ronald Aylmer Fisher. 1936. Design of experiments. British Medical Journal 1, 3923 (1936), 554.
  16. Luyu Gao and Jamie Callan. 2021a. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253 (2021).
  17. Luyu Gao and Jamie Callan. 2021b. Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540 (2021).
  18. Rethink training of BERT rerankers in multi-stage retrieval pipeline. In European Conference on Information Retrieval. Springer, 280–286.
  19. Webformer: Pre-training with Web Pages for Information Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1502–1512.
  20. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  21. SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval. arXiv preprint arXiv:2304.11370 (2023).
  22. Constructing Tree-based Index for Efficient and Effective Dense Retrieval. arXiv preprint arXiv:2304.11943 (2023).
  23. Towards Better Web Search Performance: Pre-training, Fine-tuning and Learning to Rank. arXiv preprint arXiv:2303.04710 (2023).
  24. THUIR@ COLIEE 2023: Incorporating Structural Knowledge into Pre-trained Language Models for Legal Case Retrieval. arXiv preprint arXiv:2305.06812 (2023).
  25. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  26. Less is more: Pretrain a strong Siamese encoder for dense text retrieval using a weak decoder. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2780–2791.
  27. Prop: Pre-training with representative words prediction for ad-hoc retrieval. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 283–291.
  28. B-PROP: bootstrapped pre-training with representative words prediction for ad-hoc retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1513–1522.
  29. Retrieving legal cases from a large-scale candidate corpus. Proceedings of the Eighth International Competition on Legal Information Extraction/Entailment, COLIEE2021 (2021).
  30. LeCaRD: a legal case retrieval dataset for Chinese law system. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2342–2348.
  31. Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 1212–1221.
  32. Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
  33. LeSICiN: A heterogeneous graph-based approach for automatic legal statute identification from Indian legal documents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11139–11146.
  34. COLIEE 2020: methods for legal document retrieval and entailment. In New Frontiers in Artificial Intelligence: JSAI-isAI 2020 Workshops, JURISIN, LENLS 2020 Workshops, Virtual Event, November 15–17, 2020, Revised Selected Papers 12. Springer, 196–210.
  35. Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Citeseer, 29–48.
  36. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  37. Yes, bm25 is a strong baseline for legal case retrieval. arXiv preprint arXiv:2105.05686 (2021).
  38. Understanding Relevance Judgments in Legal Case Retrieval. ACM Transactions on Information Systems 41, 3 (2023), 1–32.
  39. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. 623–632.
  40. Wikiformer: Pre-training with Structured Information of Wikipedia for Ad-hoc Retrieval. arXiv preprint arXiv:2312.10661 (2023).
  41. Caseformer: Pre-training for Legal Case Retrieval. arXiv preprint arXiv:2311.00333 (2023).
  42. THUIR2 at NTCIR-16 Session Search (SS) Task. arXiv preprint arXiv:2307.00250 (2023).
  43. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021).
  44. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  45. Attention is all you need. Advances in neural information processing systems 30 (2017).
  46. Lawformer: A pre-trained language model for chinese legal long documents. AI Open 2 (2021), 79–84.
  47. CAIL2019-SCM: A Dataset of Similar Case Matching in Legal Domain. arXiv preprint arXiv:1911.08962 (2019).
  48. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
  49. LinkBERT: Pretraining Language Models with Document Links. arXiv preprint arXiv:2203.15827 (2022).
  50. Relevance Feedback with Brain Signals. ACM Transactions on Information Systems (2023).
  51. ChengXiang Zhai. 2008. Statistical language models for information retrieval. Synthesis lectures on human language technologies 1, 1 (2008), 1–141.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Weihang Su (27 papers)
  2. Qingyao Ai (113 papers)
  3. Yueyue Wu (18 papers)
  4. Yixiao Ma (11 papers)
  5. Haitao Li (65 papers)
  6. Yiqun Liu (131 papers)
  7. Zhijing Wu (21 papers)
  8. Min Zhang (630 papers)
Citations (6)