Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPT Takes the Bar Exam (2212.14402v1)

Published 29 Dec 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Nearly all jurisdictions in the United States require a professional license exam, commonly referred to as "the Bar Exam," as a precondition for law practice. To even sit for the exam, most jurisdictions require that an applicant completes at least seven years of post-secondary education, including three years at an accredited law school. In addition, most test-takers also undergo weeks to months of further, exam-specific preparation. Despite this significant investment of time and capital, approximately one in five test-takers still score under the rate required to pass the exam on their first try. In the face of a complex task that requires such depth of knowledge, what, then, should we expect of the state of the art in "AI?" In this research, we document our experimental evaluation of the performance of OpenAI's text-davinci-003 model, often-referred to as GPT-3.5, on the multistate multiple choice (MBE) section of the exam. While we find no benefit in fine-tuning over GPT-3.5's zero-shot performance at the scale of our training data, we do find that hyperparameter optimization and prompt engineering positively impacted GPT-3.5's zero-shot performance. For best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE practice exam, significantly in excess of the 25% baseline guessing rate, and performs at a passing rate for both Evidence and Torts. GPT-3.5's ranking of responses is also highly-correlated with correctness; its top two and top three choices are correct 71% and 88% of the time, respectively, indicating very strong non-entailment performance. While our ability to interpret these results is limited by nascent scientific understanding of LLMs and the proprietary nature of GPT, we believe that these results strongly suggest that an LLM will pass the MBE component of the Bar Exam in the near future.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
Citations (142)

Summary

Insightful Overview of "GPT Takes the Bar Exam"

The paper "GPT Takes the Bar Exam" presents an empirical paper on the capabilities of OpenAI's LLM, text-davinci-003 (commonly referred to as GPT-3.5), in performing legal assessments, specifically focusing on the Multistate Bar Examination (MBE) component of the Bar Exam. The paper systematically evaluates the LLM's performance using zero-shot prompts, identifying influential hyperparameters and prompt engineering techniques that enhance performance.

Key Findings

The research establishes that GPT-3.5 can attain a headline correct rate of approximately 50.3% on a complete NCBE MBE practice exam. This performance is substantially beyond the baseline guessing rate of 25%, indicating the model's proficiency in legal language comprehension. Notably, in the categories of Evidence and Torts, GPT-3.5 reaches a passing threshold, asserting its potential in handling domain-specific tasks. Furthermore, the correlation between the model's ranking of responses and correctness is particularly remarkable, with its top two and top three choices being correct 71% and 88% of the time, respectively. This outcome suggests GPT-3.5's superior non-entailment performance.

Methodological Approach

The experimental setup involved several iterations of prompt engineering, a critical component given the sensitivity of LLMs to input prompts. Among the tested prompt strategies, "rank-ordering of the top three choices" yielded the most accurate results, demonstrating the importance of prompt structuring in maximizing the model’s efficacy. Additionally, hyperparameters such as temperature and best-of samples were varied to ascertain their impact on performance. Despite attempts at fine-tuning with additional domain-specific data, the model did not exceed the intrinsic performance seen in the zero-shot scenarios, demonstrating the current limitations and challenges in further training large proprietary models.

Implications and Future Prospects

The capability of GPT-3.5 to perform complex language tasks such as those required in the legal domain is an encouraging indication of the technological advancements in artificial intelligence. The paper highlights the potential for models like GPT-3.5 to assist in legal education and practice in the near future, especially as LLMs evolve to encompass broader and more nuanced domains of knowledge.

From a theoretical perspective, the findings underscore the necessity for better interpretability and understanding of LLMs’ internal mechanisms, which are currently constrained by their proprietary nature. Progress in this area could lead to more reliable and accurate model behavior in specialized contexts.

The research opens avenues for exploring the application of LLMs in other components of the Bar Exam, such as the essay (MEE) and situational performance (MPT) sections, thereby expanding their versatility and impact within legal domains. Furthermore, investigating alternative models from different families like Bloom, GPT-Neo, or GPT-J could provide additional insights or advances in domain-specific performance.

Conclusion

This paper contributes valuable insights into the performance of LLMs on legal assessments and bridges a gap in understanding how AI models might evolve to meet the demands of complex, language-intensive professional tasks. The empirical evidence suggests that with ongoing developments, there is optimism that an LLM can pass the MBE in the foreseeable future, marking a pivotal step in the fusion of artificial intelligence with legal processes.

Youtube Logo Streamline Icon: https://streamlinehq.com