DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment (2403.18435v1)
Abstract: Recent research demonstrates the effectiveness of using pre-trained LLMs for legal case retrieval. Most of the existing works focus on improving the representation ability for the contextualized embedding of the [CLS] token and calculate relevance using textual semantic similarity. However, in the legal domain, textual semantic similarity does not always imply that the cases are relevant enough. Instead, relevance in legal cases primarily depends on the similarity of key facts that impact the final judgment. Without proper treatments, the discriminative ability of learned representations could be limited since legal cases are lengthy and contain numerous non-key facts. To this end, we introduce DELTA, a discriminative model designed for legal case retrieval. The basic idea involves pinpointing key facts in legal cases and pulling the contextualized embedding of the [CLS] token closer to the key facts while pushing away from the non-key facts, which can warm up the case embedding space in an unsupervised manner. To be specific, this study brings the word alignment mechanism to the contextual masked auto-encoder. First, we leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability. Second, we employ the deep decoder to enable translation between different structures, with the goal of pinpointing key facts to enhance discriminative ability. Comprehensive experiments conducted on publicly available legal benchmarks show that our approach can outperform existing state-of-the-art methods in legal case retrieval. It provides a new perspective on the in-depth understanding and processing of legal case documents.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
- Legal case document similarity: You need both network and text. Information Processing & Management 59, 6 (2022), 103069.
- LEGAL-BERT: The muppets straight out of law school. arXiv preprint arXiv:2010.02559 (2020).
- Guiding neural machine translation decoding with external knowledge. In Proceedings of the Second Conference on Machine Translation. 157–168.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- I3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 441–451.
- I3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 441–451. https://doi.org/10.1145/3583780.3614923
- A simple, fast, and effective reparameterization of IBM model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 644–648.
- Luyu Gao and Jamie Callan. 2021a. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253 (2021).
- Luyu Gao and Jamie Callan. 2021b. Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540 (2021).
- Jointly learning to align and translate with transformer models. arXiv preprint arXiv:1909.02074 (2019).
- Summary of the Competition on Legal Information, Extraction/Entailment (COLIEE) 2023. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law. 472–480.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
- COLIEE 2022 Summary: Methods for Legal Document Retrieval and Entailment. In JSAI International Symposium on Artificial Intelligence. Springer, 51–67.
- SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1035–1044. https://doi.org/10.1145/3539618.3591761
- Constructing Tree-based Index for Efficient and Effective Dense Retrieval. arXiv:2304.11943 [cs.IR]
- LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset. arXiv preprint arXiv:2310.17609 (2023).
- THUIR@COLIEE 2023: Incorporating Structural Knowledge into Pre-trained Language Models for Legal Case Retrieval. arXiv:2305.06812 [cs.IR]
- THUIR@ COLIEE 2023: More Parameters and Legal Knowledge for Legal Case Entailment. arXiv preprint arXiv:2305.06817 (2023).
- On the word alignment from neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1293–1303.
- Neural machine translation with supervised attention. arXiv preprint arXiv:1609.04186 (2016).
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Zheng Liu and Yingxia Shao. 2022. RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder. arXiv preprint arXiv:2205.12035 (2022).
- Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).
- Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. arXiv preprint arXiv:2102.09206 (2021).
- LeCaRD: a legal case retrieval dataset for Chinese law system. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2342–2348.
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics 29, 1 (2003), 19–51.
- The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
- Donald B Rubin. 1980. Randomization analysis of experimental data: The Fisher randomization test comment. Journal of the American statistical association 75, 371 (1980), 591–593.
- Improving legal information retrieval using an ontological framework. Artificial Intelligence and Law 17, 2 (2009), 101–124.
- An Intent Taxonomy of Legal Case Retrieval. ACM Trans. Inf. Syst. 42, 2, Article 62 (dec 2023), 27 pages. https://doi.org/10.1145/3626093
- BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval.. In IJCAI. 3501–3507.
- Caseformer: Pre-training for Legal Case Retrieval Based on Inter-Case Distinctions. arXiv:2311.00333 [cs.IR]
- Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578 (2022).
- Contextual mask auto-encoder for dense passage retrieval. arXiv preprint arXiv:2208.07670 (2022).
- Lawformer: A pre-trained language model for chinese legal long documents. AI Open 2 (2021), 79–84.
- T2ranking: A large-scale chinese benchmark for passage ranking. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2681–2690.
- Knowledge representation for the intelligent legal case retrieval. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer, 339–345.
- Adding interpretable attention to neural translation models improves word alignment. arXiv preprint arXiv:1901.11359 (2019).
- ChengXiang Zhai. 2008. Statistical language models for information retrieval. Synthesis lectures on human language technologies 1, 1 (2008), 1–141.
- Haitao Li (65 papers)
- Qingyao Ai (113 papers)
- Xinyan Han (2 papers)
- Jia Chen (85 papers)
- Qian Dong (25 papers)
- Yiqun Liu (131 papers)
- Chong Chen (122 papers)
- Qi Tian (314 papers)