- The paper finds that LLM judge models are statistically significantly persuadable, with win rates deviating from 50% by up to over 40% in specific pairings.
- The study employs a trilateral design that contrasts advocate arguments with and without substantive legal content to isolate persuasive effects.
- Implications highlight that model size and reasoning architecture affect legal autonomy, necessitating careful AI deployment to balance responsiveness with fairness.
Introduction
The integration of LLMs into legal decision-making—either as decision-support tools or as first-instance adjudicators—necessitates an examination not just of their accuracy, but of their reasoning processes and their susceptibility to being influenced by argumentation. This paper critically evaluates LLMs’ persuadability in the context of legal decision-making, where adjudicators are expected to engage fairly with arguments presented by opposing advocates, but not be unduly swayed by rhetorical strengths independent of legal merit. The analysis centers on the tension between the normatively desirable openness to argument (persuadability) and the requirement for judicial autonomy and judgment.
Methodology and Experimental Design
The authors operationalize persuadability in a trilateral legal setting, mirroring real-world legal disputes: two "Advocate" LLMs generate arguments for opposing sides of hard legal questions, and a "Judge" LLM renders a decision based on the scenario and the arguments. The scenarios are drawn from appellate court split decisions across three Anglophone jurisdictions (US, UK, Ireland), ensuring that the legal questions presented are genuinely difficult, with no uncontested ground truth.
Advocate models (GPT-4o, GPT-5.1, Gemini-3-Pro, Claude Sonnet 4.5) generate arguments with two contextual variations: either with only the factual background or with the added legal arguments from the original case. This design permits isolation of the effect of novel (potentially higher-quality) argumentation versus more rhetorically compelling restatements. Judge models include a spectrum of open and closed-weight LLMs, varying in size and reasoning capabilities, subjected to 24,000 total scenario trials.
Persuadability is quantified both in terms of pairwise (between pairs of Advocates) and population-level differences in decision rates, using the deviation from the null expectation of 50% Advocate win rates under non-persuadability.
Key Findings
General Persuadability Trends
All evaluated Judge LLMs—open and closed, large and small—are persuadable to a statistically significant degree, with population persuadability (ppop) ranging from 8% to over 20%, and maximum pairwise persuadability (p2max) exceeding 40% in some model configurations. This implies that Advocate model identity, as proxy for argument quality, systematically affects outcomes even when the underlying legal merits are constant. The strongest advocates can win well over two-thirds of the time, and, in some judge models, over 90% in certain pairings.
Notably, larger and more capable Judge models tend to be less persuadable on average than smaller models, but this relation is not monotonic, and exceptions arise, particularly in minimal-reasoning configurations. These results align only partially with prior studies that found stronger resistance to persuasion in larger models (Bozdag et al., 3 Mar 2025).
Reasoning Architecture and Model Size
The effect of explicit reasoning augmentation (chain-of-thought/thinking budget) on persuadability is heterogeneous. In larger closed-weight models, increased reasoning correlates with decreased persuadability, but in smaller and open-weight models, reasoning sometimes increases persuadability. The authors hypothesize that this reflects differences in baseline evaluation capacity: when Judge models cannot differentiate argument quality, persuadability appears low, but this is not normatively positive—in such cases, the model is simply unresponsive to merit.
Substance Versus Rhetoric
Differences in persuadability with and without inclusion of the case’s original legal arguments are marginal but directionally consistent with a substantive effect: providing higher-quality legal argument tends, on average, to reduce relying on mere rhetorical form, but the effect sizes are small and mostly non-significant. Head-to-head trials pitting an Advocate model with and without access to original arguments confirm a mild but systematic persuasive advantage for arguments with more substantive legal content.
Jurisdictional analysis supports this: persuadability is lowest in Irish cases (presumed least well-covered in LLM pretraining), intermediate for the UK, and highest for the US (where LLMs’ prior knowledge is strongest). This hierarchy suggests persuasion is not solely driven by rhetorical formality, but also by access to legal substance not encoded in model priors.
Implications for Legal AI Deployment
The findings underscore that LLMs are neither trivially rigid nor excessively impressionable, but occupy a spectrum where persuadability is model-dependent. For practitioners deploying LLMs in adjudicatory or support contexts, the degree and locus of persuadability is critical: small or under-specified models may fail to evaluate arguments adequately, becoming inappropriately insensitive to legal merit. Larger, more capable models demonstrate greater autonomy but can still, on occasion, be swayed by strong advocacy beyond what might be appropriate for legal fairness.
Theoretically, these results complicate the prospect of LLMs as replacements for or robust assistants to human judges. The tension between responsiveness to new legal reasoning and resistance to rhetorical dominance remains unresolved and implicates both the architecture of legal AI systems and the sociotechnical processes involved in their deployment. These findings situate within ongoing debates about the process-oriented criteria for legitimacy in legal AI, including answerability, transparency, and the safeguard against arbitrary decision-making [4319969].
Directions for Future Research
Key open questions include: (1) which specific argumentative features (novel legal points, factual subtleties, rhetorical structure) are most influential in persuading LLM “judges”; (2) the extent to which persuadability correlates with improved accuracy, reasonableness, or legitimacy of legal outcomes; and (3) empirical comparisons between LLM persuadability and human judicial reasoning, including potential for adversarial exploitation or bias.
Further, substantive differences in persuadability across jurisdictions and model architectures suggest the need for domain-adaptive finetuning and explainability mechanisms if LLMs are to be trusted legal decision tools. Use of multi-agent frameworks for both persuasion and resistance to persuasion (as in (Bozdag et al., 3 Mar 2025, 2604.26233)) is a promising avenue for robust evaluation and alignment of future legal AI systems.
Conclusion
This study provides evidence that state-of-the-art LLMs exhibit significant persuadability in legal decision scenarios, with systematic effects from argument quality, model size, and reasoning architecture. The results affirm the necessity of evaluating not just the accuracy, but the decision dynamics of candidate legal AI systems, to ensure that legal decision support aligns with principles of justice, autonomy, and fairness. By systematically revealing persuasive vulnerabilities and strengths, this research establishes a crucial empirical foundation for future LLM deployment in legal practice and ongoing inquiry into machine legal reasoning.