High Recall, Small Data: The Challenges of Within-System Evaluation in a Live Legal Search System (2403.18962v1)
Abstract: This paper illustrates some challenges of common ranking evaluation methods for legal information retrieval (IR). We show these challenges with log data from a live legal search system and two user studies. We provide an overview of aspects of legal IR, and the implications of these aspects for the expected challenges of common evaluation methods: test collections based on explicit and implicit feedback, user surveys, and A/B testing. Next, we illustrate the challenges of common evaluation methods using data from a live, commercial, legal search engine. We specifically focus on methods for monitoring the effectiveness of (continuous) changes to document ranking by a single IR system over time. We show how the combination of characteristics in legal IR systems and limited user data can lead to challenges that cause the common evaluation methods discussed to be sub-optimal. In our future work we will therefore focus on less common evaluation methods, such as cost-based evaluation models.
- Bock, A.: Gütezeichen als Qualitätsaussage im digitalen Informationsmarkt: dargestellt am Beispiel elektronischer Rechtsdatenbanken. S. Toeche-Mittler (2000)
- Mart, S.N.: The Algorithm as a Human Artifact: Implications for Legal [Re]Search. 109 Law Library Journal 387 (2017)
- baron de Montesquieu, C.D.S.: L’esprit des lois. A. Belin (1817)
- van Opijnen, M.: Op en in het web: Hoe de toegankelijkheid van rechterlijke uitspraken kan worden verbeterd. Dissertation University of Amsterdam (2014)
- Peoples, Lee F.: Testing the Limits of WestlawNext. 31 Legal Reference Services Quarterly 125-49 (2012). Available at SSRN: https://ssrn.com/abstract=1910766
- Reichheld, F.F.: The one number you need to grow. Harvard business review 81.12 (2003) 46–55