Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 178 tok/s Pro
GPT OSS 120B 385 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software (2509.13471v1)

Published 16 Sep 2025 in cs.SE and cs.AI

Abstract: LLMs show promise for translating natural-language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal-critical software, using U.S. federal tax preparation as a case study. The key challenge is test-case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce higher-order metamorphic relations that compare system outputs across structured shifts among similar individuals. Because authoring such relations is tedious and error-prone, we use an LLM-driven, role-based framework to automate test generation and code synthesis. We implement a multi-agent system that translates tax code into executable software and incorporates a metamorphic-testing agent that searches for counterexamples. In experiments, our framework using a smaller model (GPT-4o-mini) achieves a worst-case pass rate of 45%, outperforming frontier models (GPT-4o and Claude 3.5, 9-15%) on complex tax-code tasks. These results support agentic LLM methodologies as a path to robust, trustworthy legal-critical software from natural-language specifications.

Summary

  • The paper introduces an LLM agentic approach that utilizes a multi-agent framework to transform legal documents into executable tax software.
  • It leverages higher-order metamorphic testing to systematically validate complex tax rules, enhancing accuracy beyond traditional methods.
  • Empirical results demonstrate that smaller LLMs, such as GPT-4o-mini, can outperform larger models in processing intricate tax code requirements.

Introduction

The paper discusses an innovative approach leveraging LLMs for developing legal-critical software, specifically focusing on U.S. federal tax preparation software. This paper introduces an agentic approach to overcome challenges such as ensuring consistency and correctness in translating complex legal documents into executable software. Key challenges include the oracle problem in test case generation due to legal interpretation requirements and ensuring reliable outputs across varying inputs. The paper proposes a higher-order metamorphic testing framework to tackle these challenges.

Synedrion: An LLM Multi-Agent Framework

The authors have developed Synedrion, a multi-agent framework that simulates real-world software development roles in legal document processing. This framework includes specialized agents like the metamorphic testing agent. The novelty lies in using smaller LLMs like GPT-4o-mini, which outperformed frontier models in specific complex tax code generation tasks, boasting a worst-case pass rate of 45% compared to benchmarks struggling at 9%-15%. Figure 1

Figure 1: Synedrion---an LLM multi-agent framework for implementing tax software from legal documents.

The framework addresses complex regulatory translations through structured interaction among agents, leveraging metamorphic testing to validate tax software against legal consistency with broader higher-order relations instead of simpler, more error-prone pairwise comparisons.

Higher-Order Metamorphic Testing

A unique contribution of this paper is the application of higher-order metamorphic testing (HMT) to tax software validation. Traditional metamorphic testing relies on pairwise comparisons that might overlook systematic errors. Higher-order testing involves evaluating multiple related test cases simultaneously, which captures systematic discrepancies more effectively. Figure 2

Figure 2: Schematic of Higher Order Metamorphic Relations.

This method particularly benefits complex tax rules, such as progressive tax brackets, by ensuring the validation of tax software goes beyond basic monotonicity and examines the statutory progressive structure's rates of change across multiple income scenarios.

Practical Implementation and Results

In practical terms, the multi-agent framework automates tax code generation from legal documents through various LLM-based agents, each contributing to stages like interpretation, specification conversion, and code refinement. Synedrion’s performance is empirically analyzed using benchmarks across six tax scenarios, indicating substantial advantages over single-model approaches, especially in complex cases.

The use of agent collaboration demonstrated that smaller LLMs could effectively handle task complexity that typically required significantly larger models. For instance, the GPT-4o-mini model within Synedrion produced robust results in scenarios involving intricate deductions and credit computations, achieving a remarkable enhancement in lower-bound accuracy.

Implications and Future Directions

The implications of this research extend significantly into the development of legal-critical systems beyond tax software. The agentic framework, with its metamorphic testing integration, offers a structured pathway for developing compliant, error-averse software in domains requiring high legal precision, such as healthcare and finance.

Future research can explore scaling this agentic architecture to incorporate additional forms of legal documentation and expand metamorphic testing categories. Enhancing agent collaboration with varying LLM capabilities could further improve efficiency and output integrity.

Conclusion

The framework presented demonstrates considerable progress in automating the translation of legal documents into reliable, executable tax software, setting a precedent for developing legal-critical systems with LLM-interaction driven methodologies. Higher-order metamorphic testing plays a crucial role in this success, ensuring comprehensive validation. Future studies may enhance this approach's breadth, providing robust, scalable solutions within regulatory technology landscapes.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 3 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube