Discovering Significant Topics from Legal Decisions with Selective Inference (2401.01068v1)
Abstract: We propose and evaluate an automated pipeline for discovering significant topics from legal decision texts by passing features synthesized with topic models through penalised regressions and post-selection significance tests. The method identifies case topics significantly correlated with outcomes, topic-word distributions which can be manually-interpreted to gain insights about significant topics, and case-topic weights which can be used to identify representative cases for each topic. We demonstrate the method on a new dataset of domain name disputes and a canonical dataset of European Court of Human Rights violation cases. Topic models based on latent semantic analysis as well as LLM embeddings are evaluated. We show that topics derived by the pipeline are consistent with legal doctrines in both areas and can be useful in other related legal analysis tasks.
- Predicting judicial decisions of the european court of human rights: A natural language processing perspective. PeerJ Computer Science, 2, 2016.
- The data-driven future of international economic law. Journal of International Economic Law, 20(2):217–231, 2017.
- Predicting outcomes of case based legal arguments. Artificial Intelligence and Law, 17:125–165, 2009.
- Regularized logistic regression without a penalty term: An application to cancer classification with microarray data. Expert Systems with Applications, 38(5):5110–5118, 2011.
- Christopher Bishop. Pattern Recognition and Machine Learning. Springer Verlag, 2006.
- Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, Mar 2003.
- Colm Brannigan. The UDRP: How do you spell success? Digital Technology Law Journal, 5(1), 2004.
- Scalable and explainable legal prediction. Artificial Intelligence and Law, 29:213–238, 2021.
- Reasoning with hierachies of open-textured predicates. In Proceedings of the International Conference on Artificial Intelligence and Law, pages 52–61, June 2023.
- Reading the high court at a distance: Topic modelling the legal subject matter and judicial activity of the high court of australia, 1903-2015. University of New South Wales Law Journal, 39(4):1300–1354, 2016.
- Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, Florence, Italy, July 2019. Association for Computational Linguistics.
- LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online, November 2020. Association for Computational Linguistics.
- LexGLUE: A benchmark dataset for legal language understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Daniel L. Chen. Judicial analytics and the great transformation of american law. Artificial Intelligence and Law, pages 1–28, forthcoming.
- W. B. Chik. Lord of your domain, but master of none: The need to harmonize and recalibrate the domain name regime of ownership and control. International Journal of Law and Information Technology, 16(1):8–72, 2007.
- Jonathan Choi. An empirical study of statutory interpretation in tax law. New York University Law Review, 95:363–441, 2020.
- A crash course in good and bad controls. Sociological Methods & Research, page 004912412210995, May 2022.
- Experiments on generalizability of bertopic on multi-domain short text, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Effective and scalable legal judgment recommendation using pre-learned word embedding. Complex and Intelligent Systems, 8:3199––3213, 2022.
- Sandor Dominich. Mathematical Foundations of Information Retrieval. Springer Science Business Media Dordrecht, 2001.
- M. Scott Donahey. The udrp - fundamentally fair, but far from perfect. Electronic Commerce & Law Reports, 6(34), August 2001.
- The voices of european law: Legislators, judges and law professors. German Law Journal, 22(6):956–982, 2021.
- The Strategic Analysis of Judicial Behavior. Cambridge University Press, May 2021.
- Utilizing vector space models for identifying legal factors from text. In Frontiers in Artificial Intelligence and Applications: Legal Knowledge and Information Systems, volume 302, pages 183–192, 2017.
- David Firth. Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38, 1993.
- Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
- Michael A. Geist. Fair.com?: An examination of the allegations of systemic unfairness in the icann udrp. Brookings Journal of International LAw, 27, 2002.
- A network approach to topic models. Science Advances, 4(7), July 2018.
- Toward automatically identifying legally relevant factors. In Frontiers in Artificial Intelligence and Applications: Legal Knowledge and Information Systems, volume 302, pages 53–62. IOS Press, December 2022.
- Automatic identification and empirical analysis of legally relevant factors. In Proceedings of the International Conference on Artificial Intelligence and Law, pages 101–110, June 2023.
- How Many Topics? Stability Analysis for Topic Models. In ECML/PKDD, volume 8724, pages 498–513, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg. Series Title: Lecture Notes in Computer Science.
- Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(03):267–297, 2013.
- Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure, 2022.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2017.
- Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Taylor & Francis Group, 2015.
- G. Heinze. The application of firth’s procedure to cox and logistic regression. Technical Report 10, 1999. Section of Clinical Biometrics, Department of Medical Computer Sciences, Medical University of Vienna, Vienna, Austria.
- HuggingFace. all-minilm-l6-v2, 2021. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.
- Learning sentence embeddings in the legal domain with low resource settings. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, pages 494–502, Manila, Philippines, October 2022. De La Salle University.
- Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972.
- A general approach for predicting the behavior of the supreme court of the united states. PLoS One, 12(4), Dec 2017.
- Patrick D. Kelley. Emerging patterns in arbitration under the uniform domain-name dispute-resolution policy. Berkeley Technology Law Journal, 17(1):181–204, Jan 2002.
- The market for private dispute resolution services - an empirical re-assessment of icann-udrp performance. Michigan Telecommunications & Technology Law Review, 11:285–380, 2005.
- Daniel Klerman. Forum selling and domain-name disputes. Loyola University Chicago Law Journal, 48:561–584, 2017.
- Annette Kur. UDRP. Max-Planck-Institute for Foreign and International Patent, Copyright and Competition Law, Munich, 2002.
- Introduction to latent semantic analysis. Discourse Processes, 25:259–284, 1998.
- Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
- Legal holding extraction from italian case documents using italian legal-bert text summarization. In Proceedings of the International Conference on Artificial Intelligence and Law, pages 148–156, June 2023.
- A predictive performance comparison of machine learning models for judicial cases. IEEE Symposium Series on Computational Intelligence, pages 1–6, 2017.
- L. Thorne McCarty. Deep semantic interpretations of legal texts. In Proceedings of the International Conference on Artificial Intelligence and Law, pages 217–224, 2007.
- hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11), mar 2017.
- Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29):861, 2018.
- A statistical analysis of multiple temperature proxies: Are reconstructions of surface temperatures over the last 1000 years reliable? Ann. Appl. Stat., 5(1):5–44, 03 2011.
- Using machine learning to predict decisions of the european court of human rights. Artificial Intelligence and Law, 28:237–266, 2020.
- Rethinking the field of automatic prediction of court decisions. Artificial Intelligence and Law, 31:195–212, 2023.
- M. Mueller. Success by Default: A New Profile of Domain Name Trademark Disputes Under ICANN’s UDRP. Syracuse University School of Information Studies, 2002.
- Milton Mueller. Rough justice: A statistical assessment of icanns uniform dispute resolution policy. The Information Society, 17(3):151–163, Jan 2001.
- Milton Mueller. Ruling the Root: Internet Governance and the Taming of Cyberspace. MIT Press, 2002.
- A Pragmatic Approach to Semantic Annotation for Search of Legal Texts – An Experiment on GDPR. In Proceedings of JURIX, pages 23–32. IOS Press, December 2021.
- Branthover Ned. UDRP — A Success Story: A Rebuttal to the Analysis and Conclusions of Professor Milton Mueller in “Rough Justice”. International Trademark Association, 2002.
- A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1348–1356. Curran Associates, Inc., 2009.
- Pre-trained language models for the legal domain: A case study on indian law. In Proceedings of the International Conference on Artificial Intelligence and Law, pages 187–196, June 2023.
- Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Statistics in Medicine, 35(7), 2015.
- Brooke Rowland. Writing a Statement of Facts in an Appellate Brief, 2014. Retrieved 30 June 2023 at https://www.law.georgetown.edu/wp-content/uploads/2018/07/StatementofFactsinaBriefFinal.pdf.
- The supreme court forecasting project: Legal and political science approaches to predicting supreme court decisionmaking. Columbia Law Review, 104(4):1150, 2004.
- Why Do Tenants Sue Their Landlords? Answers from a Topic Model. In Frontiers in Artificial Intelligence and Applications: Legal Knowledge and Information Systems. IOS Press, December 2022.
- Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988.
- Topic modeling, long texts and the best number of topics. Some Problems and solutions. Quality & Quantity, 54(4):1095–1108, August 2020.
- Scikit-Learn. Truncated Singular Value Decomposition and Latent Semantic Analysis, 2017. Retrieved 30 June 2023 at http://scikit-learn.org/stable/modules/decomposition.html#lsa.
- Topic modelling of legal documents via legal-bert. In CEUR Workshop Proceedings, volume 2896, 2021.
- Jerrold Soh. A network analysis of the singapore court of appeal’s citations to precedent. Singapore Academy of Law Journal, 31:246–284, 2019.
- Jerrold Soh. Causal Inference with Legal Texts. MIT Computational Law Report, dec 7 2021. https://law.mit.edu/pub/causalinferencewithlegaltexts.
- Legal Area Classification: A Comparative Study of Text Classifiers on Singapore Supreme Court Judgments. In Proceedings of the Natural Legal Language Processing Workshop 2019, pages 67–77, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Predicting the law area and decisions of french supreme court cases. Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 716–722, Oct 2017.
- Post-selection inference for l1-penalized likelihood models. Canadian Journal of Statistics, 46(1):41–61, Jun 2017.
- Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
- selectiveInference: Tools for Post-Selection Inference, 2017. R package version 1.2.4.
- MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721, 2009.
- Penalized logistic regression for classification and feature selection with its application to detection of two official species of ganoderma. Chemometrics and Intelligent Laboratory Systems, 171:55–64, 2017.
- Christopher Zorn. A solution to separation in binary response models. Political Analysis, 13:157–170, 2005.
- Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(2):301–320, 2005.