Can Commercial LLMs Be Parliamentary Political Companions? Comparing LLM Reasoning Against Romanian Legislative Expuneri de Motive

Published 31 Mar 2026 in cs.CY | (2603.30028v1)

Abstract: This paper evaluates whether commercial LLMs can function as reliable political advisory tools by comparing their outputs against official legislative reasoning. Using a dataset of 15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive), we test six LLMs spanning three provider families and multiple capability tiers: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, and Llama 3.1 8B (Meta). Each model generates predicted rationales evaluated through a dual framework combining LLM-as-Judge semantic scoring and programmatic text similarity metrics. We frame the LLM-politician relationship through principal-agent theory and bounded rationality, conceptualizing the legislator as a principal delegating advisory tasks to a boundedly rational agent under structural information asymmetry. Results reveal a sharp two-tier structure: frontier models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve statistically indistinguishable semantic closeness scores above 4.6 out of 5.0, while open-weight models cluster a full tier below (Cohen's d larger than 1.4). However, all models exhibit task-dependent confabulation, performing well on standardized legislative templates (e.g., EU directive transpositions) but generating plausible yet unfounded reasoning for politically idiosyncratic proposals. We introduce the concept of cascading bounded rationality to describe how failures compound across bounded principals, agents, and evaluators, and argue that the operative risk for legislators is not stable ideological bias but contextual ignorance shaped by training data coverage.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that frontier LLMs achieve semantic closeness scores above 4.6/5 in evaluating legislative reasoning.
It employs dual evaluation metrics—custom semantic scoring and programmatic measures like ROUGE and cosine similarity—to assess argument quality and factual accuracy.
The study reveals risks from structural confabulation and cascading bounded rationality, highlighting the need for human-in-the-loop verification.

Evaluating LLMs as Political Advisory Agents through Legislative Reasoning: Insights from Romanian Parliamentary Data

Introduction

This paper conducts a rigorous empirical evaluation of commercial LLMs in the domain of parliamentary legislative reasoning, focusing on their potential as practical political advisory tools. Comparing model-generated rationales for Romanian Senate law proposals against official government expuneri de motive, the study systematically interrogates whether and how current commercial and open-weight LLMs can function as boundedly rational agents in high-stakes, legislative advisory roles. The research also critically engages with the theoretical paradigms of principal-agent theory and cascading bounded rationality, situating LLMs as both agents and monitors within complex human-AI decision environments.

Experimental Framework and Methodology

The evaluation is based on a dataset of 15 Romanian law proposals, each paired with their official explanatory memoranda, encompassing a spectrum from standardized EU directive transpositions to niche, idiosyncratic domestic bills. The models tested include frontier, closed-weight models—GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic)—and open-weight, Meta Llama family models (Llama 4 Maverick, Llama 3.3 70B, Llama 3.1 8B) spanning dense and Mixture-of-Experts (MoE) architectures. Each system is prompted as a domain expert to generate legislative rationales.

Assessment is dual-layered:

LLM-as-Judge semantic scoring via a custom prompt-evaluator provides coverage, factual alignment, and holistic closeness scores (five-point scale).
Programmatic metrics include surface-level n-gram (ROUGE), embedding-based (cosine similarity), and legal entity overlap evaluations, adding multidimensionality to performance profiling.

Key Results

Two-Tier Model Performance

A clear bifurcation emerges: “frontier” closed-source models (Claude Haiku 4.5, GPT-5-mini, GPT-5-chat) form a top tier with mean semantic closeness scores above 4.6/5.0, whereas all open-weight models constitute a second tier with scores in the 3.75–4.00 range. This between-tier difference is highly significant (Cohen’s d > 1.4). Intra-tier differences (e.g., scaling from Llama 8B to 70B, or Maverick’s MoE architecture) produce only minor, nonsignificant gains, highlighting the irrelevance of parameter count or architectural complexity within open-weight families for legislative advisory.

Chain-of-Thought Reasoning and Domain Knowledge

Comparing GPT-5-mini (chain-of-thought) to GPT-5-chat isolates reasoning capabilities from model architecture. Chain-of-thought yields higher argument coverage (4.64 vs. 4.37, p = 0.002), but does not translate to increased holistic quality or factual accuracy. Factual alignment remains bottlenecked by underlying domain knowledge and training distribution rather than reasoning sophistication. This indicates that in legal reasoning, retrieval and grounding dominate over intermediate step expansion.

The Coverage–Factuality Gap and Structural Risk

Claude Haiku 4.5, despite matching GPT-5 models on overall semantic closeness, exhibits a structurally larger gap (Δ = 0.36) between argument coverage and factual accuracy, frequently hallucinating legal specifics (e.g., fabricated article numbers or institutional references). This decoupling underscores that models can learn the “shape” of legislative argumentation while lacking factual anchoring, posing unique risks when output verbosity and fluency are mistaken for reliability.

Failure Modes: Task-Dependent Confabulation

All models perform robustly on standardized legislative templates (e.g., EU harmonization), but are prone to task-dependent confabulation when confronted with idiosyncratic or under-represented legislative rationales—a limitation shaped by training data coverage, not persistent ideological slant. The study reframes legislative LLM failure as epistemological (contextual ignorance) rather than ideological, subverting prior focus on left-leaning bias as the primary operative risk.

Metric Divergence

Embedding-based metrics (e.g., cosine similarity) are almost uncorrelated with expert judge scores (r = –0.04). N-gram overlap metrics like ROUGE exhibit only moderate correlation, confirming that surface similarity is insufficient for evaluating functionally and semantically rich legal text.

LLM-as-Judge as Monitoring: Constraints and Agency Problems

Score compression and output-length confounds are observed in LLM-as-Judge evaluation: longer outputs (5,500–8,600 chars vs. 2,900–3,200 in open-weight models) tend to receive higher semantic closeness scores (r = 0.56), raising concerns about verbosity bias. Intra-model and per-law consistency is sufficient for coarse ranking, but monitoring limitations instantiate “second-order agency problems” where the judge may also be blind to crucial errors—a layered agency dilemma in AI advisory ecosystems.

Implications: Principal-Agent Theory, Cascading Bounded Rationality, and Agency Costs

Viewed through the lens of principal-agent theory, the selection and use of LLMs by legislators is fraught with acute information asymmetry; traditional market signals (model tier, parameter count, cost) do not reliably encode domain-specific advisory performance. Moreover, the layered structure in which a bounded principal (legislator) engages a statistically bounded agent (LLM) and is monitored by another bounded agent (LLM-as-Judge) produces “cascading bounded rationality.” This compounds potential for confabulation to elude detection and propagate through organizational structures.

The “competence trap” arises when models output fluent, on-template arguments devoid of factual correctness, a new form of epistemic agency cost that is not directly addressable through classic incentive design or benchmarking. Instead, it demands institutionally calibrated monitoring, targeted to areas of legislative unpredictability, and human-AI task allocation where LLM outputs are systematically reviewed, particularly for politically novel or contextually unique bills.

Theoretical and Practical Consequences

The study’s empirical findings have broader theoretical ramifications. They forecast that as LLMs are further embedded in legislative workflows—particularly in non-Anglophone, less well-represented regions—risks will manifest chiefly as undetected context-bound mistakes rather than overt bias or misalignment. Practically, this demands re-alignment of quality assurance paradigms and the development of new legal text evaluation metrics attuned to regional diversity and the unique epistemic structure of parliamentary argumentation.

For model and application developers, the findings underscore that “intuitive” metrics, model size, or even cost-per-token are distractions in the absence of domain-specific, semantically anchored benchmarks. Human-in-the-loop (HITL) verification—especially by domain experts—remains indispensable unless and until training data representation and factual retrievability are dramatically improved.

Conclusion

This study demonstrates that while commercial, frontier LLMs can closely emulate official legislative reasoning in Romanian parliamentary data, their reliability as political advisors is contingent on the fit between training data, task structure, and institutional monitoring. Key risks arise from structural confabulation rather than stable ideological bias and are amplified by layered bounded rationality among all actors in the advisory pipeline. Future research should pursue broader multilingual evaluations, longitudinal model tracking, and the co-development of legal-domain quality metrics to ensure AI deployment in legislative contexts enhances, rather than merely replicates, human rationality constraints.

Markdown Report Issue