- The paper demonstrates that frontier LLMs achieve semantic closeness scores above 4.6/5 in evaluating legislative reasoning.
- It employs dual evaluation metrics—custom semantic scoring and programmatic measures like ROUGE and cosine similarity—to assess argument quality and factual accuracy.
- The study reveals risks from structural confabulation and cascading bounded rationality, highlighting the need for human-in-the-loop verification.
Evaluating LLMs as Political Advisory Agents through Legislative Reasoning: Insights from Romanian Parliamentary Data
Introduction
This paper conducts a rigorous empirical evaluation of commercial LLMs in the domain of parliamentary legislative reasoning, focusing on their potential as practical political advisory tools. Comparing model-generated rationales for Romanian Senate law proposals against official government expuneri de motive, the study systematically interrogates whether and how current commercial and open-weight LLMs can function as boundedly rational agents in high-stakes, legislative advisory roles. The research also critically engages with the theoretical paradigms of principal-agent theory and cascading bounded rationality, situating LLMs as both agents and monitors within complex human-AI decision environments.
Experimental Framework and Methodology
The evaluation is based on a dataset of 15 Romanian law proposals, each paired with their official explanatory memoranda, encompassing a spectrum from standardized EU directive transpositions to niche, idiosyncratic domestic bills. The models tested include frontier, closed-weight models—GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic)—and open-weight, Meta Llama family models (Llama 4 Maverick, Llama 3.3 70B, Llama 3.1 8B) spanning dense and Mixture-of-Experts (MoE) architectures. Each system is prompted as a domain expert to generate legislative rationales.
Assessment is dual-layered:
- LLM-as-Judge semantic scoring via a custom prompt-evaluator provides coverage, factual alignment, and holistic closeness scores (five-point scale).
- Programmatic metrics include surface-level n-gram (ROUGE), embedding-based (cosine similarity), and legal entity overlap evaluations, adding multidimensionality to performance profiling.
Key Results
A clear bifurcation emerges: “frontier” closed-source models (Claude Haiku 4.5, GPT-5-mini, GPT-5-chat) form a top tier with mean semantic closeness scores above 4.6/5.0, whereas all open-weight models constitute a second tier with scores in the 3.75–4.00 range. This between-tier difference is highly significant (Cohen’s d > 1.4). Intra-tier differences (e.g., scaling from Llama 8B to 70B, or Maverick’s MoE architecture) produce only minor, nonsignificant gains, highlighting the irrelevance of parameter count or architectural complexity within open-weight families for legislative advisory.
Chain-of-Thought Reasoning and Domain Knowledge
Comparing GPT-5-mini (chain-of-thought) to GPT-5-chat isolates reasoning capabilities from model architecture. Chain-of-thought yields higher argument coverage (4.64 vs. 4.37, p = 0.002), but does not translate to increased holistic quality or factual accuracy. Factual alignment remains bottlenecked by underlying domain knowledge and training distribution rather than reasoning sophistication. This indicates that in legal reasoning, retrieval and grounding dominate over intermediate step expansion.
The Coverage–Factuality Gap and Structural Risk
Claude Haiku 4.5, despite matching GPT-5 models on overall semantic closeness, exhibits a structurally larger gap (Δ = 0.36) between argument coverage and factual accuracy, frequently hallucinating legal specifics (e.g., fabricated article numbers or institutional references). This decoupling underscores that models can learn the “shape” of legislative argumentation while lacking factual anchoring, posing unique risks when output verbosity and fluency are mistaken for reliability.
Failure Modes: Task-Dependent Confabulation
All models perform robustly on standardized legislative templates (e.g., EU harmonization), but are prone to task-dependent confabulation when confronted with idiosyncratic or under-represented legislative rationales—a limitation shaped by training data coverage, not persistent ideological slant. The study reframes legislative LLM failure as epistemological (contextual ignorance) rather than ideological, subverting prior focus on left-leaning bias as the primary operative risk.
Metric Divergence
Embedding-based metrics (e.g., cosine similarity) are almost uncorrelated with expert judge scores (r = –0.04). N-gram overlap metrics like ROUGE exhibit only moderate correlation, confirming that surface similarity is insufficient for evaluating functionally and semantically rich legal text.
LLM-as-Judge as Monitoring: Constraints and Agency Problems
Score compression and output-length confounds are observed in LLM-as-Judge evaluation: longer outputs (5,500–8,600 chars vs. 2,900–3,200 in open-weight models) tend to receive higher semantic closeness scores (r = 0.56), raising concerns about verbosity bias. Intra-model and per-law consistency is sufficient for coarse ranking, but monitoring limitations instantiate “second-order agency problems” where the judge may also be blind to crucial errors—a layered agency dilemma in AI advisory ecosystems.
Implications: Principal-Agent Theory, Cascading Bounded Rationality, and Agency Costs
Viewed through the lens of principal-agent theory, the selection and use of LLMs by legislators is fraught with acute information asymmetry; traditional market signals (model tier, parameter count, cost) do not reliably encode domain-specific advisory performance. Moreover, the layered structure in which a bounded principal (legislator) engages a statistically bounded agent (LLM) and is monitored by another bounded agent (LLM-as-Judge) produces “cascading bounded rationality.” This compounds potential for confabulation to elude detection and propagate through organizational structures.
The “competence trap” arises when models output fluent, on-template arguments devoid of factual correctness, a new form of epistemic agency cost that is not directly addressable through classic incentive design or benchmarking. Instead, it demands institutionally calibrated monitoring, targeted to areas of legislative unpredictability, and human-AI task allocation where LLM outputs are systematically reviewed, particularly for politically novel or contextually unique bills.
Theoretical and Practical Consequences
The study’s empirical findings have broader theoretical ramifications. They forecast that as LLMs are further embedded in legislative workflows—particularly in non-Anglophone, less well-represented regions—risks will manifest chiefly as undetected context-bound mistakes rather than overt bias or misalignment. Practically, this demands re-alignment of quality assurance paradigms and the development of new legal text evaluation metrics attuned to regional diversity and the unique epistemic structure of parliamentary argumentation.
For model and application developers, the findings underscore that “intuitive” metrics, model size, or even cost-per-token are distractions in the absence of domain-specific, semantically anchored benchmarks. Human-in-the-loop (HITL) verification—especially by domain experts—remains indispensable unless and until training data representation and factual retrievability are dramatically improved.
Conclusion
This study demonstrates that while commercial, frontier LLMs can closely emulate official legislative reasoning in Romanian parliamentary data, their reliability as political advisors is contingent on the fit between training data, task structure, and institutional monitoring. Key risks arise from structural confabulation rather than stable ideological bias and are amplified by layered bounded rationality among all actors in the advisory pipeline. Future research should pursue broader multilingual evaluations, longitudinal model tracking, and the co-development of legal-domain quality metrics to ensure AI deployment in legislative contexts enhances, rather than merely replicates, human rationality constraints.