Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 84 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 96 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Kimi K2 189 tok/s Pro

2000 character limit reached

Automatic Legal Writing Evaluation of LLMs (2504.21202v1)

Published 29 Apr 2025 in cs.CL and cs.AI

Abstract: Despite the recent advances in LLMs, benchmarks for evaluating legal writing remain scarce due to the inherent complexity of assessing open-ended responses in this domain. One of the key challenges in evaluating LLMs on domain-specific tasks is finding test datasets that are public, frequently updated, and contain comprehensive evaluation guidelines. The Brazilian Bar Examination meets these requirements. We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam. The benchmark includes comprehensive evaluation guidelines and reference materials used by human examiners to ensure consistent grading. We evaluate the performance of four LLMs on oab-bench, finding that Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams. We also investigated whether LLMs can serve as reliable automated judges for evaluating legal writing. Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams, suggesting their potential as reliable automated evaluators despite the inherently subjective nature of legal writing assessment. The source code and the benchmark -- containing questions, evaluation guidelines, model-generated responses, and their respective automated evaluations -- are publicly available.

Collections

Summary

Automatic Evaluation of Legal Writing by LLMs: A Benchmarking Study

The research under discussion presents a comprehensive paper on the evaluation of LLMs in the legal writing domain, specifically focusing on the context of the Brazilian Bar Examination. The core contribution is the development and introduction of the oab-bench, a novel benchmark crafted to challenge LLMs with tasks closely resembling real-world legal writing assessments.

Motivation and Methodological Approach

The paper addresses the challenge of assessing LLMs' performance in handling domain-specific, open-ended tasks. Legal writing, characterized by its subjective and interpretative nature, presents significant hurdles for automated evaluation systems, which necessitate sophisticated benchmarks equipped with comprehensive evaluation guidelines. The Brazilian Bar Examination's second-phase essay and discursive questions are well-suited for this purpose due to their availability, structured guidelines, and the frequency of updates, thereby minimizing the risk of data contamination.

Oab-bench consists of 105 questions across seven legal disciplines, drawn from the latest three editions of the examination. This benchmark not only includes the questions but also the official evaluation criteria and reference materials utilized by human examiners, enabling a consistent basis for automated grading. Such a structured approach ensures that LLMs are evaluated against the standards applied to human candidates, thus providing a reliable measure of their capabilities in legal reasoning and articulation.

Experimental Evaluation and Results

The exploration into using LLMs as automated judges—particularly frontier models like OpenAI's o1—reveals promising potential for substituting or complementing human evaluators. The experiments demonstrate that these models can achieve a strong correlation with human-assigned scores, despite the tasks' inherently subjective nature.

Across the 21 exams from oab-bench, Claude-3.5 Sonnet excelled with an average score of 7.93, passing all evaluations, which underscores its ability to understand and address complex legal issues effectively. In contrast, models such as GPT-4o and Sabiá-3 showed moderately weaker performance, with challenges noted particularly in areas like Business Law. The findings highlight the variability in LLM performance across different legal domains, suggesting potential avenues for model improvement and specialization.

Implications and Future Directions

The capability of LLMs to serve as automated judges has substantial theoretical and practical implications. Theoretically, it broadens the understanding of LLMs in specialized domains, emphasizing the need for more intricate benchmarks like oab-bench. Practically, the development of reliable automated grading systems could significantly enhance educational and professional certification processes in law by reducing the dependency on human evaluators and increasing consistency in grading.

Future developments could focus on refining LLMs' alignment with human grading practices, potentially incorporating advanced techniques in prompt engineering and model tuning to better capture the nuanced reasoning expected in legal writing. Moreover, extending the benchmark to incorporate a wider variety of legal systems globally could facilitate a comprehensive evaluation of general and domain-specific capabilities of LLMs.

Overall, the paper represents a significant stride towards integrating AI into legal evaluations, offering valuable insights into the challenges and opportunities involved in automating the assessment of complex, open-ended tasks.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now