Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 96 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Kimi K2 189 tok/s Pro
2000 character limit reached

Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models (2505.02763v1)

Published 5 May 2025 in cs.CL, cs.AI, and cs.CY

Abstract: Legal practice requires careful adherence to procedural rules. In the United States, few are more complex than those found in The Bluebook: A Uniform System of Citation. Compliance with this system's 500+ pages of byzantine formatting instructions is the raison d'etre of thousands of student law review editors and the bete noire of lawyers everywhere. To evaluate whether LLMs are able to adhere to the procedures of such a complicated system, we construct an original dataset of 866 Bluebook tasks and test flagship LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. We show (1) that these models produce fully compliant Bluebook citations only 69%-74% of the time and (2) that in-context learning on the Bluebook's underlying system of rules raises accuracy only to 77%. These results caution against using off-the-shelf LLMs to automate aspects of the law where fidelity to procedure is paramount.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

The research conducted by Matthew Dahl critically assesses whether contemporary LLMs can automate aspects of legal procedure by adhering to the intricate rules of The Bluebook: A Uniform System of Citation. LLMs, with their potential to transform various facets of legal practice, still face scrutiny regarding their procedural reliability. This paper explores the accuracy of LLMs in executing tasks governed by the Bluebook, a citation manual notorious for its complexity, used widely across the United States legal framework.

Methodology and Analysis

In a meticulous examination, Dahl tests multiple flagship LLMs, such as OpenAI's GPT-4.1, Google's Gemini 2.5 Flash, and others, using a dataset of 866 Bluebook-specific tasks. These tasks are categorized into case law tasks, enacted law tasks, and others, examining models' performance in generating citations that comply with the Bluebook’s detailed standards.

The paper assesses LLMs in a zero-shot context and explores the potential benefits of in-context learning, leveraging the Indigo Book’s rules—a public domain counterpart to the Bluebook. Zero-shot learning refers to the model's ability to generate correct outputs without prior exposure to training examples, while in-context learning involves guiding the model with textual instructions.

Key Findings

  1. Zero-shot Performance: The models achieve only moderate success, with accuracy rates ranging from 69% to 74% when penalizing only substantive errors. When strict italicization compliance is enforced, these rates drop significantly, revealing the models' limitations in capturing Bluebook rules’ nuances.
  2. Memorization Concerns: Many models demonstrated improved results on memorized citation structures, suggesting a reliance on memorization over genuine rule comprehension. This is particularly evident in the handling of well-known cases, where models benefit from pre-existing knowledge rather than understanding the rules governing citation format.
  3. In-Context Learning Limitations: Despite improvements when models utilize explicit rule sets from the Indigo Book, performance gains were limited. The results indicate that comprehending and applying complex citation rules remains beyond the current capabilities of LLMs, even with substantial context window capabilities.
  4. Task-specific Variability: Notably, the LLMs exhibit varying proficiencies across different types of citations. They perform better on tasks related to basic case law components but struggle with the intricacies of statutory and regulatory citations, which involve diverse and complex rules.

Implications

These findings affirm that, despite technological advancements, LLMs are not yet equipped to replace the labor-intensive tasks associated with Bluebook compliance. The models' current limitations highlight substantial challenges in deploying LLMs for procedural tasks that require strict adherence to established legal norms.

Future Directions

Continued research is necessary to enhance LLM capabilities in rule-based contexts, expanding on fine-tuning approaches and meticulous prompt engineering. Improvements in long-context processing could advance models' utility in legal procedures. However, this research underscores the importance of human oversight in legal applications of AI, particularly in tasks demanding high fidelity to procedural standards.

Conclusion

The paper contributes significantly to the discourse on AI in law, providing empirical insights into the procedural inadequacies of current LLMs. It cautions against the premature automation of legal practices that heavily depend on procedural precision, advocating for collaborative human-AI systems where machine learning supports rather than replaces human legal expertise.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)

X Twitter Logo Streamline Icon: https://streamlinehq.com