- The paper introduces MASER, a multi-agent simulation framework that generates synthetic legal interaction data (SynthLaw) from real case documents to train Large Language Models (LLMs).
- LLMs fine-tuned on SynthLaw significantly outperform baseline models on the new Multi-stage Interactive Legal Evaluation (MILE) benchmark, which assesses interactive legal task performance.
- MASER provides a scalable solution to legal interactive data scarcity, enabling the development of more capable legal AI systems for dynamic client interactions and complex task completion.
MASER: A Multi-Agent Simulation Framework for Legal LLM Development
The paper "Multi-Agent Simulator Drives LLMs for Legal Intensive Interaction" (2502.06882) introduces a novel framework, the Multi-agent Legal Simulation Driver (MASER), designed to address the data scarcity challenge in training LLMs for interactive legal scenarios. MASER leverages real-world legal case data to generate synthetic dialogues between a Lawyer agent and a Client agent, supervised by a third Supervisor agent. The resulting synthetic dataset, SynthLaw, is then used to fine-tune LLMs, significantly improving their performance in interactive legal tasks. The paper also introduces the Multi-stage Interactive Legal Evaluation (MILE) benchmark for evaluating LLMs in dynamic legal settings.
Methodology: MASER Framework Components
The MASER framework consists of two primary stages: Role Agent Presetting and Multi-Agent Legal Simulation, with a final stage of training involving the generated dataset SynthLaw.
Role Agent Presetting
This stage focuses on ensuring character authenticity and legal consistency among the agents. Real-world legal data is extracted from 4,532 Chinese civil judgment documents using GPT-4o, identifying key elements such as plaintiff/defendant information, claims, case details, evidence, legal provisions, and analysis. These elements establish a consistent legal context for the simulation. Each agent is then configured as follows:
- Client: Assigned personal/case information derived from the legal source, personality traits based on the Big-5 model mapped to speaking style and interactivity, and a manually defined level of "legal sense". Client personality diversity is achieved by mapping Big-5 traits (Extraversion, Emotional Stability, Openness, Agreeableness, Conscientiousness) to speaking style (logic, clarity, tone) and interactivity parameters generated via GPT-4o based on high/medium/low trait levels.
- Lawyer: Assigned case analysis and applicable laws from the same legal source as the client, representing their prior legal knowledge. A manually designed legal agenda is also assigned to the Lawyer, outlining the steps required for the specific legal task (e.g., complaint drafting).
- Supervisor: Possesses all information relevant to the case and the agents, enabling oversight of the interaction.
Multi-Agent Legal Simulation
In this stage, realistic, multi-turn interaction dialogues are generated, culminating in the completion of the legal task. Agent behaviors are powered by LLMs such as GPT-4o during the simulation.
- Client Behaviors: Exhibit cooperation, a unique communication style reflecting their personality, curiosity (asking questions), and distraction behaviors (missing details, vagueness).
- Lawyer Behaviors: Follows the pre-defined legal agenda, utilizes prior legal knowledge flexibly, and reacts to client distractions by asking clarifying questions.
- Supervisor Behaviors: Oversees the interaction at the sentence level, ensuring profile-behavior alignment (checking if the speaker's response aligns with their preset profile) and distractor alignment (guiding agents to handle predefined distractors correctly). A correction mechanism provides natural language feedback to the speaker (Client or Lawyer) if their response is inconsistent or incorrect, prompting revision.
The simulation follows a turn-based interaction flow initiated by the Client. The Lawyer guides the conversation according to the legal agenda. The Supervisor interacts only with the current speaker, providing feedback and potentially triggering response revision. The interaction concludes when the Lawyer determines the inquiry is complete or the maximum number of turns is reached. Finally, the Lawyer generates the complaint based on the interaction history.
Training
The simulation outputs (dialogue history + final complaint) form the SynthLaw dataset (4,532 samples). Additional legal Q&A data is also collected. A base LLM (e.g., Qwen2.5-instruct-7B) is fine-tuned on SynthLaw using a standard supervised learning (LLMing) objective to predict the Lawyer's responses.
MILE Benchmark for Evaluation
The Multi-stage Interactive Legal Evaluation (MILE) benchmark is designed to evaluate an LLM's capability as a Lawyer agent within a dynamic interactive setting.
Dataset Construction
MILE utilizes 693 distinct complaint drafting scenarios derived from different (2024) Chinese civil judgment documents than those used for MASER training data. GPT-4o is used to process these documents into client profiles and reference complaints.
Evaluation Setup
The LLM being evaluated acts as the Lawyer. A powerful LLM (GPT-4o) simulates the Client, guided by its profile and overseen by a Supervisor (also GPT-4o) to ensure consistent client behavior during evaluation.
Evaluation Stages
MILE employs a multi-stage evaluation approach:
- Interaction Evaluation: Assesses the quality of the interaction process using a fine-grained (2-turn window) approach. Metrics evaluated by GPT-4o (scale 1-10) include:
- Interactivity: Active participation, asking clarifying questions.
- Professionality: Use of legal terms, citing laws, offering strategies.
- Logicality: Maintaining coherent conversation flow.
- Goal Evaluation: Assesses the quality of the final output (the drafted complaint). Evaluation is performed both locally and globally.
- Local Evaluation: Assesses the accuracy of individual complaint sections (Client Info (CLI), Defendant Info (DEF), Fact & Reason (F&R), Claims (CLA), Evidence (EVID)). CLI/DEF use exact matching; others use GPT-4o scoring (1-10) against a ground truth complaint.
- Global Evaluation: Assesses overall quality via GPT-4o (scale 1-10) across:
- Standardability (STA): Adherence to the required document template/format.
- Professionalism (PROF): Correctness and appropriateness of legal language.
The paper presents extensive experimental results demonstrating the effectiveness of the MASER framework. LLMs fine-tuned on data generated by MASER (SynthLaw models) significantly outperform strong baseline LLMs, including proprietary models like GPT-4o and specialized legal LLMs like LawLLM, on the MILE benchmark. The SynthLaw models demonstrate superior performance, particularly in achieving the final legal goal of producing high-quality complaints.
Impact and Significance
The MASER framework offers a scalable solution to the scarcity of interactive legal data, crucial for training more capable legal AI systems. The paper demonstrates that simulation-driven training can significantly improve LLMs' ability to handle dynamic interactions, follow procedural agendas, manage client uncertainties, and achieve specific legal goals. The MILE benchmark provides a more realistic evaluation paradigm for legal LLMs compared to static benchmarks, assessing practical skills needed in real-world legal services. This work represents a significant step towards developing legal AI systems capable of actively participating in information gathering and task completion in collaboration with a human user, mimicking aspects of legal consultation or client intake. The authors suggest the MASER framework could be adapted for other complex, specialized domains where interactive data is scarce but simulation grounded in real-world knowledge is feasible (e.g., healthcare, finance).
Conclusion
In summary, the paper "Multi-Agent Simulator Drives LLMs for Legal Intensive Interaction" introduces MASER, a sophisticated simulation engine for generating legal interaction data, and MILE, a novel benchmark for evaluating LLMs in such dynamic scenarios. The methodology's strengths lie in grounding the simulation in real legal cases, incorporating behavioral diversity and control mechanisms. The results demonstrate a viable path to train LLMs for complex, interactive legal tasks, thereby advancing the development of more practical and capable legal intelligent systems.