Claude 4.5 Sonnet: Advanced LLM Performance
- Claude 4.5 Sonnet is a large language model that excels in structured financial modeling and medical reasoning yet faces challenges in safety and multimodal tasks.
- Empirical results reveal a 25-29% pass rate on enterprise finance workflows and top rankings in clinical reasoning benchmarks, emphasizing both its strengths and limitations.
- Agentic orchestration significantly improves both performance (up to 85.3/100) and safety compliance, highlighting the need for external governance in high-stakes applications.
Claude 4.5 Sonnet is a LLM that has achieved state-of-the-art results as a base model on robust, large-scale benchmarks in finance and medicine. Distinguished by high performance in knowledge-intensive and complex reasoning tasks, it remains limited in safety, ethics, and the handling of multimodal or highly structured enterprise workflows. Its agentic variants, however, demonstrate substantial gains via orchestration, tool integration, and governance-aware controls. The following sections detail empirical results, task-specific performance, evaluation methodology, limitation analysis, and the safety-governance landscape as established by the Finch and MedBench v4 benchmarks (Dong et al., 15 Dec 2025, Ding et al., 18 Nov 2025).
1. Performance on Enterprise Finance Workflows (Finch Benchmark)
Claude Sonnet 4.5’s capabilities were rigorously assessed using the Finch benchmark—derived from in-the-wild, spreadsheet-centric professional workflows extracted from corporate environments such as Enron—comprising 172 composite workflows with 384 granular tasks (Dong et al., 15 Dec 2025). Evaluation considered both human expert judgment and automated large-language-model-as-judge scoring.
Overall pass rate:
| Model | Pass (human eval) | Pass Rate (%) |
|---|---|---|
| GPT 5.1 Pro | 66 / 172 | 38.4 |
| Claude Sonnet 4.5 | 43 / 172 | 25.0 |
Automated judge scoring for Claude Sonnet 4.5 yields 50/172 (29.1%), with 90.2% agreement (accuracy) relative to human experts.
Domain-wise, the pass rate varies:
| Business Type | Workflows | Pass Rate (approx.) |
|---|---|---|
| Reporting | 48 | 29% |
| Trading & Risk Management | 35 | 32% |
| Predictive Modeling | 33 | 31% |
| Operational Management | 36 | 26% |
| Planning & Budgeting | 26 | 22% |
| Pricing & Valuation | 15 | 24% |
| Accounts Payable/Receivable | 10 | 30% |
| Procurement & Sales | 7 | 18% |
| Asset Management | 3 | 20% |
The model is most reliable in highly structured workflow domains (e.g., trading summaries, mid-scale predictive modeling) and underperforms on layout-driven and data-transformation-heavy tasks.
2. Task-Type Performance, Error Analysis, and Modes of Success/Failure
Finch subdivides tasks into nine classes; Claude Sonnet 4.5 exhibits significant variability across these, with highest performance on financial modeling, calculation, and visualization. Pass rates by task class:
- Calculation (e.g., formula synthesis): ~35%
- Financial Modeling: ~45%
- Summary/Visualization: ~40%
- Cross-sheet/File Retrieval: ~30%
- Validation/Review: ~32%
- Structuring/Formatting: ~18%
- Data Entry/Import: ~20%
- Web Search: ~15%
- Translation (layout-preserving): ~5%
Frequent success modes:
- Precise formula generation and charting in workflows with clear, regular spreadsheet structures.
- Synthesis of multistep financial models (e.g., scenario-based XNPV).
Frequent failure modes:
- Import of irregularly structured tabular data (PDF/image→spreadsheet).
- Multi-step processes where error propagation undermines subsequent steps.
- Translation involving merged cells, where output misaligns headers/captions.
Error breakdown:
- Formula reasoning errors: 35%
- Data retrieval (range/sheet) mistakes: 25%
- Code-generation failures: 25%
- Task misinterpretation/context omission: 10%
- Formatting/rendering bugs: 5%
Qualitative analysis shows that while Claude Sonnet 4.5 can produce polished charts (surpassing GPT 5.1 Pro in some respects), it omits essential features (e.g., axis labels), and translated spreadsheet outputs often distort original structure (Dong et al., 15 Dec 2025).
3. Medical Reasoning and Safety: MedBench v4 Results
The MedBench v4 benchmark evaluates LLM performance over 700,000 tasks encompassing 24 medical primary specialties and 91 subspecialties, with domain-specific, expert-curated, and workflow-aligned tasks (Ding et al., 18 Nov 2025). Claude Sonnet 4.5 is the top-ranked base LLM with a mean overall score of 62.5/100 (macro-averaged), and leads in knowledge, generation, and complex reasoning ability.
| Capability Dimension | Score (Sonnet 4.5) | Rank (Base LLMs) |
|---|---|---|
| Medical Language Understanding (MLU) | ~60/100 | 2nd |
| Medical Language Generation (MLG) | ~64/100 | 1st |
| Medical Knowledge QA (MKQA) | ~66/100 | 1st |
| Complex Medical Reasoning (CMR) | ~65/100 | 1st |
| Healthcare Safety & Ethics (HSE) | ~20/100 | low |
| Overall Mean | 62.5/100 | 1st |
Safety and ethics remain the weakest domain (HSE ~20/100), barely exceeding the base LLM average (18.4/100); no base model surpasses ~25/100. Claude Sonnet 4.5 therefore exhibits a universal challenge for open-ended safety compliance in clinical LLMs.
4. Agentic Orchestration and Safety/Performance Enhancement
When embedded in an agentic orchestration framework, Claude Sonnet 4.5 exhibits dramatic gains. Orchestration enriches base LLMs with external tool usage, explicit multi-step planning (e.g., “Reason→Action→Observe” cycles), and modular safety guardrails.
- Overall agent track score: 85.3/100 (vs. 62.5/100 base).
- Safety/ethics (HSE) agent score: 88.9/100 (vs. ~20/100 base).
Tool governance (API access), iterative planning, and prompt-level safety filters are the principal contributors to these gains. The base model’s conservative defaults for high-risk queries (e.g., deflecting medical advice with disclaimers) ensure compliance but are less performant than explicit, stepwise control in orchestrated settings (Ding et al., 18 Nov 2025).
5. Evaluation Methodologies
Finch methodology (Dong et al., 15 Dec 2025):
- Workflows sourced from version histories and real communications of financial institutions.
- LLM-assisted workflow extraction and expert annotation (>700 hours).
- Pass/fail scoring: workflow is passed only if final outputs fully match natural-language instructions, with no critical errors or unintended edits.
- Both human expert and LLM-as-judge (automated) evaluation, with judge agreement (accuracy) at 90.2%.
MedBench v4 evaluation (Ding et al., 18 Nov 2025):
- Task curation and multi-stage clinician review across >24 specialties.
- LLM-as-a-judge (Qwen2.5-72B-Instruct) calibrated against human raters (Cohen’s κ > 0.82).
- Four-axis scoring: correctness, professionalism, compliance/safety, usability.
- Formal scoring includes micro-F1, macro-recall, IoU, and normalized edit distance for structured, open-ended, and multimodal tasks.
6. Identified Limitations and Failure Modes
Five primary limitations characterize Claude Sonnet 4.5 in the Finch benchmark environment:
- Cross-artifact navigation failure: Error propagation across interlinked spreadsheets due to indexing/range misalignment.
- Terminological mis-grounding: Confusion over visually similar, but distinct, financial terms.
- Layout inference breakdown: Inability to edit or extract from irregular, merged, or nested table structures.
- Under-utilization of embedded formulas: Reliance on static values, missing embedded business logic.
- Multimodal vision bottlenecks: Lossy parsing and incomplete extraction from PDF/image artifacts.
In the medical domain, the preeminent weakness is safety and ethics compliance at the base model level, a quality that only agentic orchestration currently mitigates (Dong et al., 15 Dec 2025, Ding et al., 18 Nov 2025).
7. Multimodal Reasoning and Governance Constraints
Claude Sonnet 4.5 demonstrates state-of-the-art unimodal (text) clinical reasoning and report generation. For clinical decision support tasks combining text and image, it leads the generalist cohort, but underperforms relative to specialist vision-LLMs in core perception subtasks (object detection, OCR), and in complex cross-modal QA. Conservative default governance limits some capabilities directly in the base model, while modular, prompt-level safety and multi-step process validation in agentic settings significantly enhance both safety and efficacy.
This suggests that high-performing unimodal LLMs in specialized domains still require external orchestration, tool access, and robust governance scaffolds to be viable for deployment in high-stakes, real-world enterprise and clinical environments (Dong et al., 15 Dec 2025, Ding et al., 18 Nov 2025).