PRBench: Evaluating Professional Reasoning
- PRBench is a family of open-ended, expert-crafted benchmarks that assess professional reasoning on complex legal and financial tasks.
- It features context-rich, realistic scenarios developed by professionals, using rigorous rubrics to capture nuanced decision-making.
- Empirical results highlight sub-0.4 scores on high-stakes challenges, underscoring the need for improved transparency and domain-specific reasoning in AI models.
Professional Reasoning Bench (PRBench) comprises a family of open-ended, expert-crafted benchmarks for evaluating the professional reasoning capabilities of advanced language and multimodal models in high-stakes, real-world domains. Developed to address the limitations of academic and public benchmarks, which primarily test narrow, verifiable problems, PRBench introduces realistic, context-rich tasks from high-impact professions—primarily Law and Finance—supplemented by domain-specific rubrics and rigorous automated evaluation protocols. The benchmark systematically measures a model's ability to deliver actionable, contextually nuanced, and economically consequential advice within professional workflows, capturing both overall performance and detailed strengths and weaknesses by competency area (Akyürek et al., 14 Nov 2025).
1. Rationale and Benchmark Scope
The motivation for PRBench is rooted in the observed gap between the capabilities measured by traditional academic benchmarks (e.g., MMLU, GPQA, ARC-AGI) and the demands of real-world professional tasks. Existing evaluations are typically limited to short-form, closed-domain problems, failing to capture open-ended, economically impactful decisions encountered in professional practice. PRBench fills this evaluation deficiency by focusing on tasks requiring domain-specific judgment, interpretability, and contextual understanding in high-stakes environments (Akyürek et al., 14 Nov 2025).
PRBench targets two of the most critical and economically significant categories requested in commercial chat platforms:
- Legal: contract interpretation, litigation strategy, regulatory compliance, procedural advice, risk forecasting, negotiation.
- Finance: valuation, financial modeling, capital-raising, transaction structuring, operational controls, forecasting, regulatory reporting, risk management.
The initial public release of PRBench consists of 1,100 expert-authored tasks (600 Finance, 500 Legal) ranging across 114 countries and 47 US jurisdictions. The benchmark's “Hard” subset isolates the most challenging 300 Finance and 250 Legal items.
2. Task Construction and Rubric Design
PRBench tasks are authored by 182 qualified professionals—lawyers (JDs or international equivalents), finance professionals (CFAs, master’s degrees, or ≥6 years’ professional experience)—using prompts inspired by genuine workflow and client inquiries. Purely theoretical or exam-style tasks are explicitly excluded. Approximately 30% of tasks involve multi-turn dialogues (up to 10 turns) to simulate clarifying conversations typical of professional scenarios. Each prompt undergoes a dual-review process by a second domain expert to ensure clarity, specificity, and self-containment (Akyürek et al., 14 Nov 2025).
For each task, experts define 10–30 binary rubric criteria, assigned integer weights from −10 (“Critically Detrimental”) to +10 (“Critically Important”) to encode the importance and severity of individual aspects. Positive weights correspond to desirable features; negative weights penalize harmful or off-target content. Rubric criteria satisfy properties of constructiveness, atomicity, objectivity, mutual exclusivity and collective exhaustiveness, and self-containment.
Rubric axes are divided into shared and domain-specific categories:
| Shared Rubric Axes | Finance-Specific | Legal-Specific |
|---|---|---|
| Practical Utility | Financial Accuracy | Legal Accuracy |
| Handling Uncertainty | Process Transparency & Auditability | Procedural Correctness |
| Supplemental Insight | Risk & Regulatory Disclosure | Application of Law |
| Instruction Following | Risk & Ethical Disclosure | |
| Risk/Ethics |
A validation pipeline applies automated checks for format, duplication, and coverage, augmented by random QA audits and independent expert review, resulting in 93.9% agreement on rubric validity.
3. Evaluation Protocols
Twenty leading proprietary and open-source chat models are evaluated on the full task set, including GPT-5 (High Reasoning), GPT-5 Pro, GPT-4.1, Claude 4.5, Gemini 2.5 Pro/Flash, O3, Grok 4.1, GPT OSS 120B, and Kimi K2 Thinking (Akyürek et al., 14 Nov 2025). Each model is given a 60-minute timeout per task with up to five retries and is run under vendor-recommended token budgets (e.g., 32k tokens for Claude Sonnet).
Model responses are scored automatically by an LLM-based judge (o4-mini), which grades rubric criteria as satisfied () or not (). Judge reliability is measured at Cohen’s and macro F1 against human experts, matching inter-expert agreement.
Scoring formulas:
- Per-prompt score (clipped to [0,1]):
- Overall model score :
- Min-normalized (penalty-aware) score:
- Category-level scores are reported analogously.
4. Empirical Findings
On the Hard task subsets, maximum observed scores are 0.39 (Finance) and 0.37 (Legal); on the full sets (600 Finance, 500 Legal), top scores are 0.51 and 0.50, respectively. GPT-5 and GPT-5 Pro lead the rankings, closely followed by Grok 4 Fast Reasoning, O3, GPT OSS 120B, and Kimi K2 Thinking (Akyürek et al., 14 Nov 2025).
Performance varies significantly by rubric axis. For Finance, GPT-5 leads in Financial Accuracy, but Process Transparency and Auditability is below 0.3 for all models. GPT-5 exhibits strength in Handling Uncertainty and Practical Utility. In Legal, GPT-5 surpasses the next best by over 10% on Legal Accuracy and Application of Law; Grok 4 matches closely on Procedural Correctness; Risk & Ethical Disclosure is led by Grok 4 and Mistral Medium, while Gemini 2.5 Pro excels at Instruction Following.
Common failure modes include:
- Inaccurate or missing domain-specific judgments (e.g., failure to identify critical statutes or miscalculating financial metrics)
- Opaque, non-explanatory reasoning, with conclusions given absent underlying steps or explicit assumptions
- Incomplete exploration of edge-cases or relevant caveats
- Tendency to include hallucinated or irrelevant external references when web search tools are active
An economic impact analysis annotates each task along Decision Type (e.g., Transaction Economics, Compliance, Risk Forecasting) and Economic Pathway (e.g., Value Creation, Penalty Avoidance, Compliance Efficiency). Finance models are weakest on “Markets/Transactions” and “Capital Funding,” while Legal models underperform on “Compliance Efficiency” and “Penalty Avoidance.” This mapping enables identification of particularly challenging value dimensions for future model improvement (Akyürek et al., 14 Nov 2025).
5. Comparison to Related and Complementary Benchmarks
The emergence of PRBench is situated among multiple recent efforts to address the limitations of closed-form academic benchmarks. ProBench (Yang et al., 10 Mar 2025) targets multimodal expert-level queries spanning 10 fields and 56 sub-fields and evaluates 24 models (proprietary and open-weight) using the MLLM-as-a-Judge protocol, pairwise Likert scoring, and ELO ranking. The focus is on open-ended, image/text-driven tasks with tracks for single-round, multi-round, and multi-linguistic queries; top proprietary and open-weight models achieve closely matched ELO scores (e.g., Claude-3.5-Sonnet ≈1301 multi-lingual; Pixtral-Large-Instruct ≈1294).
ProfBench (Wang et al., 21 Oct 2025) provides over 7,000 response–criterion pairs in Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA. It uses expert-crafted rubrics and affordable LLM-based judges that minimize self-enhancement bias, enabling cost-efficient large-scale evaluation. Top-performing models (e.g., GPT-5-high) achieve only 65.9% in overall score, with notable domain disparities (Physics 49.3%, Consulting 80.0%).
A key distinction of PRBench is the exclusive focus on high-stakes professional reasoning in Law and Finance and the depth of rubric-based, expert-authored validation in these domains. While ProBench and ProfBench diversify the landscape into multimodal and broader multi-domain evaluation, PRBench provides the largest and most granular public testbed for rubric-based assessment of LLMs in real-world economically critical professional domains.
6. Implications and Future Directions
Findings from PRBench highlight that frontier models remain substantially below the reliability threshold for unsupervised use in high-stakes professional workflows, evidenced by sub-0.4 scores on the hardest real-world cases. Weaknesses are most pronounced in process transparency, auditability, and completeness—deficits that undermine trust even when raw factual accuracy is present (Akyürek et al., 14 Nov 2025).
Recommendations for future model development and benchmarking include:
- Training with domain-specific procedural data and chain-of-thought annotations to improve transparency and reasoning traceability.
- Expansion of PRBench to incorporate more jurisdictions, additional professional domains (e.g., healthcare, engineering), and multimodal context (e.g., contracts, spreadsheets).
- Utilization of rubric-based reward signals (e.g., via reinforcement learning with rubric anchors) to optimize for practitioner-relevant dimensions.
- Inclusion of decision-type and economic-pathway annotations to better dissect model competence across decision-making axes.
The open-sourcing of PRBench’s 1,100 expert-authored tasks, 19,356 validated rubric criteria, annotated economic impact categories, and evaluation code provides a robust infrastructure for driving advances in the development and assessment of LLMs targeted at professional reasoning (Akyürek et al., 14 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free