Automatic Evaluation of Legal Writing by LLMs: A Benchmarking Study
The research under discussion presents a comprehensive paper on the evaluation of LLMs in the legal writing domain, specifically focusing on the context of the Brazilian Bar Examination. The core contribution is the development and introduction of the oab-bench, a novel benchmark crafted to challenge LLMs with tasks closely resembling real-world legal writing assessments.
Motivation and Methodological Approach
The paper addresses the challenge of assessing LLMs' performance in handling domain-specific, open-ended tasks. Legal writing, characterized by its subjective and interpretative nature, presents significant hurdles for automated evaluation systems, which necessitate sophisticated benchmarks equipped with comprehensive evaluation guidelines. The Brazilian Bar Examination's second-phase essay and discursive questions are well-suited for this purpose due to their availability, structured guidelines, and the frequency of updates, thereby minimizing the risk of data contamination.
Oab-bench consists of 105 questions across seven legal disciplines, drawn from the latest three editions of the examination. This benchmark not only includes the questions but also the official evaluation criteria and reference materials utilized by human examiners, enabling a consistent basis for automated grading. Such a structured approach ensures that LLMs are evaluated against the standards applied to human candidates, thus providing a reliable measure of their capabilities in legal reasoning and articulation.
Experimental Evaluation and Results
The exploration into using LLMs as automated judges—particularly frontier models like OpenAI's o1—reveals promising potential for substituting or complementing human evaluators. The experiments demonstrate that these models can achieve a strong correlation with human-assigned scores, despite the tasks' inherently subjective nature.
Across the 21 exams from oab-bench, Claude-3.5 Sonnet excelled with an average score of 7.93, passing all evaluations, which underscores its ability to understand and address complex legal issues effectively. In contrast, models such as GPT-4o and Sabiá-3 showed moderately weaker performance, with challenges noted particularly in areas like Business Law. The findings highlight the variability in LLM performance across different legal domains, suggesting potential avenues for model improvement and specialization.
Implications and Future Directions
The capability of LLMs to serve as automated judges has substantial theoretical and practical implications. Theoretically, it broadens the understanding of LLMs in specialized domains, emphasizing the need for more intricate benchmarks like oab-bench. Practically, the development of reliable automated grading systems could significantly enhance educational and professional certification processes in law by reducing the dependency on human evaluators and increasing consistency in grading.
Future developments could focus on refining LLMs' alignment with human grading practices, potentially incorporating advanced techniques in prompt engineering and model tuning to better capture the nuanced reasoning expected in legal writing. Moreover, extending the benchmark to incorporate a wider variety of legal systems globally could facilitate a comprehensive evaluation of general and domain-specific capabilities of LLMs.
Overall, the paper represents a significant stride towards integrating AI into legal evaluations, offering valuable insights into the challenges and opportunities involved in automating the assessment of complex, open-ended tasks.