Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 96 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Kimi K2 189 tok/s Pro
2000 character limit reached

Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction (2502.06882v1)

Published 8 Feb 2025 in cs.CL and cs.AI

Abstract: LLMs have significantly advanced legal intelligence, but the scarcity of scenario data impedes the progress toward interactive legal scenarios. This paper introduces a Multi-agent Legal Simulation Driver (MASER) to scalably generate synthetic data by simulating interactive legal scenarios. Leveraging real-legal case sources, MASER ensures the consistency of legal attributes between participants and introduces a supervisory mechanism to align participants' characters and behaviors as well as addressing distractions. A Multi-stage Interactive Legal Evaluation (MILE) benchmark is further constructed to evaluate LLMs' performance in dynamic legal scenarios. Extensive experiments confirm the effectiveness of our framework.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces MASER, a multi-agent simulation framework that generates synthetic legal interaction data (SynthLaw) from real case documents to train Large Language Models (LLMs).
  • LLMs fine-tuned on SynthLaw significantly outperform baseline models on the new Multi-stage Interactive Legal Evaluation (MILE) benchmark, which assesses interactive legal task performance.
  • MASER provides a scalable solution to legal interactive data scarcity, enabling the development of more capable legal AI systems for dynamic client interactions and complex task completion.

The paper "Multi-Agent Simulator Drives LLMs for Legal Intensive Interaction" (2502.06882) introduces a novel framework, the Multi-agent Legal Simulation Driver (MASER), designed to address the data scarcity challenge in training LLMs for interactive legal scenarios. MASER leverages real-world legal case data to generate synthetic dialogues between a Lawyer agent and a Client agent, supervised by a third Supervisor agent. The resulting synthetic dataset, SynthLaw, is then used to fine-tune LLMs, significantly improving their performance in interactive legal tasks. The paper also introduces the Multi-stage Interactive Legal Evaluation (MILE) benchmark for evaluating LLMs in dynamic legal settings.

Methodology: MASER Framework Components

The MASER framework consists of two primary stages: Role Agent Presetting and Multi-Agent Legal Simulation, with a final stage of training involving the generated dataset SynthLaw.

Role Agent Presetting

This stage focuses on ensuring character authenticity and legal consistency among the agents. Real-world legal data is extracted from 4,532 Chinese civil judgment documents using GPT-4o, identifying key elements such as plaintiff/defendant information, claims, case details, evidence, legal provisions, and analysis. These elements establish a consistent legal context for the simulation. Each agent is then configured as follows:

  • Client: Assigned personal/case information derived from the legal source, personality traits based on the Big-5 model mapped to speaking style and interactivity, and a manually defined level of "legal sense". Client personality diversity is achieved by mapping Big-5 traits (Extraversion, Emotional Stability, Openness, Agreeableness, Conscientiousness) to speaking style (logic, clarity, tone) and interactivity parameters generated via GPT-4o based on high/medium/low trait levels.
  • Lawyer: Assigned case analysis and applicable laws from the same legal source as the client, representing their prior legal knowledge. A manually designed legal agenda is also assigned to the Lawyer, outlining the steps required for the specific legal task (e.g., complaint drafting).
  • Supervisor: Possesses all information relevant to the case and the agents, enabling oversight of the interaction.

In this stage, realistic, multi-turn interaction dialogues are generated, culminating in the completion of the legal task. Agent behaviors are powered by LLMs such as GPT-4o during the simulation.

  • Client Behaviors: Exhibit cooperation, a unique communication style reflecting their personality, curiosity (asking questions), and distraction behaviors (missing details, vagueness).
  • Lawyer Behaviors: Follows the pre-defined legal agenda, utilizes prior legal knowledge flexibly, and reacts to client distractions by asking clarifying questions.
  • Supervisor Behaviors: Oversees the interaction at the sentence level, ensuring profile-behavior alignment (checking if the speaker's response aligns with their preset profile) and distractor alignment (guiding agents to handle predefined distractors correctly). A correction mechanism provides natural language feedback to the speaker (Client or Lawyer) if their response is inconsistent or incorrect, prompting revision.

The simulation follows a turn-based interaction flow initiated by the Client. The Lawyer guides the conversation according to the legal agenda. The Supervisor interacts only with the current speaker, providing feedback and potentially triggering response revision. The interaction concludes when the Lawyer determines the inquiry is complete or the maximum number of turns is reached. Finally, the Lawyer generates the complaint based on the interaction history.

Training

The simulation outputs (dialogue history + final complaint) form the SynthLaw dataset (4,532 samples). Additional legal Q&A data is also collected. A base LLM (e.g., Qwen2.5-instruct-7B) is fine-tuned on SynthLaw using a standard supervised learning (LLMing) objective to predict the Lawyer's responses.

MILE Benchmark for Evaluation

The Multi-stage Interactive Legal Evaluation (MILE) benchmark is designed to evaluate an LLM's capability as a Lawyer agent within a dynamic interactive setting.

Dataset Construction

MILE utilizes 693 distinct complaint drafting scenarios derived from different (2024) Chinese civil judgment documents than those used for MASER training data. GPT-4o is used to process these documents into client profiles and reference complaints.

Evaluation Setup

The LLM being evaluated acts as the Lawyer. A powerful LLM (GPT-4o) simulates the Client, guided by its profile and overseen by a Supervisor (also GPT-4o) to ensure consistent client behavior during evaluation.

Evaluation Stages

MILE employs a multi-stage evaluation approach:

  • Interaction Evaluation: Assesses the quality of the interaction process using a fine-grained (2-turn window) approach. Metrics evaluated by GPT-4o (scale 1-10) include:
    • Interactivity: Active participation, asking clarifying questions.
    • Professionality: Use of legal terms, citing laws, offering strategies.
    • Logicality: Maintaining coherent conversation flow.
  • Goal Evaluation: Assesses the quality of the final output (the drafted complaint). Evaluation is performed both locally and globally.
    • Local Evaluation: Assesses the accuracy of individual complaint sections (Client Info (CLI), Defendant Info (DEF), Fact & Reason (F&R), Claims (CLA), Evidence (EVID)). CLI/DEF use exact matching; others use GPT-4o scoring (1-10) against a ground truth complaint.
    • Global Evaluation: Assesses overall quality via GPT-4o (scale 1-10) across:
    • Standardability (STA): Adherence to the required document template/format.
    • Professionalism (PROF): Correctness and appropriateness of legal language.

Experimental Results and Performance

The paper presents extensive experimental results demonstrating the effectiveness of the MASER framework. LLMs fine-tuned on data generated by MASER (SynthLaw models) significantly outperform strong baseline LLMs, including proprietary models like GPT-4o and specialized legal LLMs like LawLLM, on the MILE benchmark. The SynthLaw models demonstrate superior performance, particularly in achieving the final legal goal of producing high-quality complaints.

Impact and Significance

The MASER framework offers a scalable solution to the scarcity of interactive legal data, crucial for training more capable legal AI systems. The paper demonstrates that simulation-driven training can significantly improve LLMs' ability to handle dynamic interactions, follow procedural agendas, manage client uncertainties, and achieve specific legal goals. The MILE benchmark provides a more realistic evaluation paradigm for legal LLMs compared to static benchmarks, assessing practical skills needed in real-world legal services. This work represents a significant step towards developing legal AI systems capable of actively participating in information gathering and task completion in collaboration with a human user, mimicking aspects of legal consultation or client intake. The authors suggest the MASER framework could be adapted for other complex, specialized domains where interactive data is scarce but simulation grounded in real-world knowledge is feasible (e.g., healthcare, finance).

Conclusion

In summary, the paper "Multi-Agent Simulator Drives LLMs for Legal Intensive Interaction" introduces MASER, a sophisticated simulation engine for generating legal interaction data, and MILE, a novel benchmark for evaluating LLMs in such dynamic scenarios. The methodology's strengths lie in grounding the simulation in real legal cases, incorporating behavioral diversity and control mechanisms. The results demonstrate a viable path to train LLMs for complex, interactive legal tasks, thereby advancing the development of more practical and capable legal intelligent systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.