- The paper introduces an LLM agentic approach that utilizes a multi-agent framework to transform legal documents into executable tax software.
- It leverages higher-order metamorphic testing to systematically validate complex tax rules, enhancing accuracy beyond traditional methods.
- Empirical results demonstrate that smaller LLMs, such as GPT-4o-mini, can outperform larger models in processing intricate tax code requirements.
An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software
Introduction
The paper discusses an innovative approach leveraging LLMs for developing legal-critical software, specifically focusing on U.S. federal tax preparation software. This paper introduces an agentic approach to overcome challenges such as ensuring consistency and correctness in translating complex legal documents into executable software. Key challenges include the oracle problem in test case generation due to legal interpretation requirements and ensuring reliable outputs across varying inputs. The paper proposes a higher-order metamorphic testing framework to tackle these challenges.
Synedrion: An LLM Multi-Agent Framework
The authors have developed Synedrion, a multi-agent framework that simulates real-world software development roles in legal document processing. This framework includes specialized agents like the metamorphic testing agent. The novelty lies in using smaller LLMs like GPT-4o-mini, which outperformed frontier models in specific complex tax code generation tasks, boasting a worst-case pass rate of 45% compared to benchmarks struggling at 9%-15%.
Figure 1: Synedrion---an LLM multi-agent framework for implementing tax software from legal documents.
The framework addresses complex regulatory translations through structured interaction among agents, leveraging metamorphic testing to validate tax software against legal consistency with broader higher-order relations instead of simpler, more error-prone pairwise comparisons.
A unique contribution of this paper is the application of higher-order metamorphic testing (HMT) to tax software validation. Traditional metamorphic testing relies on pairwise comparisons that might overlook systematic errors. Higher-order testing involves evaluating multiple related test cases simultaneously, which captures systematic discrepancies more effectively.
Figure 2: Schematic of Higher Order Metamorphic Relations.
This method particularly benefits complex tax rules, such as progressive tax brackets, by ensuring the validation of tax software goes beyond basic monotonicity and examines the statutory progressive structure's rates of change across multiple income scenarios.
Practical Implementation and Results
In practical terms, the multi-agent framework automates tax code generation from legal documents through various LLM-based agents, each contributing to stages like interpretation, specification conversion, and code refinement. Synedrion’s performance is empirically analyzed using benchmarks across six tax scenarios, indicating substantial advantages over single-model approaches, especially in complex cases.
The use of agent collaboration demonstrated that smaller LLMs could effectively handle task complexity that typically required significantly larger models. For instance, the GPT-4o-mini model within Synedrion produced robust results in scenarios involving intricate deductions and credit computations, achieving a remarkable enhancement in lower-bound accuracy.
Implications and Future Directions
The implications of this research extend significantly into the development of legal-critical systems beyond tax software. The agentic framework, with its metamorphic testing integration, offers a structured pathway for developing compliant, error-averse software in domains requiring high legal precision, such as healthcare and finance.
Future research can explore scaling this agentic architecture to incorporate additional forms of legal documentation and expand metamorphic testing categories. Enhancing agent collaboration with varying LLM capabilities could further improve efficiency and output integrity.
Conclusion
The framework presented demonstrates considerable progress in automating the translation of legal documents into reliable, executable tax software, setting a precedent for developing legal-critical systems with LLM-interaction driven methodologies. Higher-order metamorphic testing plays a crucial role in this success, ensuring comprehensive validation. Future studies may enhance this approach's breadth, providing robust, scalable solutions within regulatory technology landscapes.