Claude-3.5 Sonnet: LLM Benchmark Overview

Updated 23 November 2025

Claude-3.5 Sonnet is a transformer-based LLM with advanced reasoning, multilingual capabilities, and robust safety controls.
It demonstrates competitive performance via zero-shot and few-shot benchmarks in STEM, medical, and cybersecurity tasks.
Benchmark results reveal nuanced strengths and limitations across disciplines, guiding future fine-tuning and safety protocols.

Claude-3.5 Sonnet is a proprietary LLM developed by Anthropic, released in multiple versions throughout 2024. It is designed to deliver high reasoning competence, multilingual generalization, and safety controls, with extensive benchmarking reported across STEM, medical, social, and software engineering domains. The following sections provide a comprehensive review of Claude-3.5 Sonnet's technical performance, benchmarking methodologies, application domains, robustness features, and known limitations, as documented in recent peer-reviewed and preprint literature.

1. Technical Profile and Benchmark Overview

Claude-3.5 Sonnet is a general-purpose transformer-based LLM situated between small/fast (Haiku) and large (Opus) Anthropic models. It has multiple released versions, notably Claude-3.5-Sonnet-20240620 and Claude-3.5-Sonnet-20241022. The model is tested on zero-shot and few-shot downstream tasks, including AI cyber risk benchmarking, STEM academic competitions, engineering and medical domain exams, automated software testing, rater effect modeling, and adversarial safety audits (Ristea et al., 29 Oct 2024, Huang et al., 24 Jun 2024, Truyts et al., 26 Jul 2025, Syed et al., 15 Aug 2024, Barradas et al., 5 Sep 2025, Safavi-Naini et al., 25 Aug 2024, Jiao et al., 24 May 2025, Saeed et al., 31 Oct 2024, Erziev, 28 Feb 2025).

2. General Benchmark Performance and Academic Reasoning

Claude-3.5 Sonnet consistently demonstrates strong zero-shot performance on multidisciplinary assessments. In the OlympicArena benchmark, which aggregates problems from 62 international Olympiad-level competitions across mathematics, physics, chemistry, biology, geography, astronomy, and computer science, Claude-3.5 Sonnet achieved an overall score of 39.24%, ranking narrowly behind GPT-4o (40.47%) but surpassing Gemini-1.5 Pro (35.09%) and all open-source models. Claude-3.5 Sonnet outperformed GPT-4o in physics (31.16% vs. 30.01%), chemistry (47.27% vs. 46.68%), and biology (56.05% vs. 53.11%), attributed to focused improvements in cause-and-effect and decompositional reasoning. However, it lags behind GPT-4o in mathematics (23.18% vs. 28.32%) and computer science (5.19% vs. 8.43%) (Huang et al., 24 Jun 2024).

Model	Overall (%)	Gold	Silver	Bronze
GPT-4o	40.47	4	3	0
Claude-3.5 Sonnet	39.24	3	3	0
Gemini-1.5-Pro	35.09	0	0	6

Claude-3.5 Sonnet also leads on domain-specific tasks. On the TransportBench suite for undergraduate transportation engineering, it yielded the highest zero-shot accuracy (67.1%) and the best mixed response rate consistency (MRR, 8.2% across T/F items) among seven LLMs, including GPT-4, GPT-4o, Gemini 1.5 Pro, Llama 3, and Llama 3.1 (Syed et al., 15 Aug 2024).

3. Domain-Specific Capabilities

3.1 Medical Reasoning and Multilingual Performance

Claude-3.5 Sonnet matches or exceeds median human performance in specialty-level medical assessments across multiple languages. On the 2022 American College of Gastroenterology board benchmark (300 MCQs), Claude-3.5 Sonnet achieved top accuracy (74.0%), equivalent to the mean human score (74.52%) and slightly ahead of GPT-4o (73.7%). Structured prompts combining expert-mimicry and chain-of-thought yielded peak accuracy (Safavi-Naini et al., 25 Aug 2024).

In a Brazilian Portuguese residency exam (HCFMUSP, 117 MCQs), Claude-3.5 Sonnet achieved 70.3% (text-only) and 69.6% (mixed text+image), both within the human score density peak (65–70%). It showed modest performance drops on radiology image items but outperformed sibling models in multimodal integration (Truyts et al., 26 Jul 2025).

3.2 Automated Software Engineering and Cybersecurity

In REST API test-case generation using the RestTSLLM framework, Claude-3.5 Sonnet outperformed 7 LLMs across all evaluation axes: 100% test success rate, 71.7% branch coverage, 40.8% mutation score, and zero test failures across 230 cases. These results were attributed to effective system prompting, robust handling of intermediate TSL representations, and coherent xUnit test code generation (Barradas et al., 5 Sep 2025).

In automated cybersecurity exploitation (AIxCC Nginx challenge), Claude-3.5 Sonnet delivered sub-10s test input loops and moderate cost efficiency ($4.55–7.45 per exploit for 21.4% and 14.3% success rates in 20240620 and 20241022, respectively). While lagging behind o1-preview (64.7–78.6% success), Claude-3.5 Sonnet is suitable as a real-time “fuzz” assistant, highlighting a dual-use risk profile (Ristea et al., 29 Oct 2024).

4. Robustness, Bias, Consistency, and Security

4.1 Bias and Adversarial Safety

Claude-3.5 Sonnet underwent extensive bias and red-teaming audits focused on Arab-Western cultural stereotypes (Saeed et al., 31 Oct 2024). It displayed the lowest aggregate bias and jailbreak rates among six models (including GPT-4, GPT-4o, LLaMA 3.1, Mistral 7B). Across 800 prompts and eight bias categories, Claude-3.5 Sonnet’s bias score for “Arabs as losers” was consistently below those of GPT-4o and LLaMA 405B, with the most pronounced disparities in anti-Semitism (54.4%) and backwardness (50.98%). For jailbreak adversarial prompts, its attack success rate was lowest (≤13.3% per category and 0.09 overall in code obfuscation tests), far below competing closed and open models.

4.2 Rating Consistency and Rater Effects

On constructed-response scoring (AP Chinese, holistic and analytic), Claude-3.5 Sonnet’s quadratic weighted kappa (QWK) ranged 0.576–0.896 versus human raters (baseline: 0.68–0.87). Cronbach’s alpha for intra-prompt reliability reached 0.94 (analytic) and 0.80 (holistic), on par with expert humans. Many-Facet Rasch Model analyses confirmed mild leniency and well-calibrated fit statistics, supporting stable, reliable scoring for low-stakes educational settings (Jiao et al., 24 May 2025).

Task Score Type	Claude 3.5 QWK	Human Baseline QWK
Holistic (per essay)	0.576–0.896	0.68–0.87
Analytic (traits)	0.574–0.754	0.715–0.950

4.3 Consistency and Reasoning brittleness

Consistency evaluations in engineering contexts highlighted strengths and weaknesses. Claude-3.5 Sonnet achieved an aggregate T/F accuracy of 75.6% over 365 trials with a zero-shot MRR of 8.2%. However, self-check prompts led to 16 incorrect flips among 73 T/F problems, indicating decision-boundary brittleness. Prompting strategies, especially chain-of-thought and template design, significantly modulate consistency and correctness (Syed et al., 15 Aug 2024).

5. Limitations, Security Findings, and Known Vulnerabilities

Despite robust bias and adversarial resistance, Claude-3.5 Sonnet exhibits residual vulnerabilities. Systematic code obfuscation (artificial encoding of instructions into visually nonsensical Unicode) revealed that 2.14–4.08% of such encodings were “understood” without nudging, rising to 3.32–6.38% with a nudge. Adversarial attack success rates (ASR) for code-based jailbreaks were among the lowest (0.09 for n=50 encodings), but not zero. Failure cases included successful generation of harmful responses when properly encoded inputs bypassed superficial content moderation, implicating BPE tokenization artifacts as a potential latent channel for instruction leakage (Erziev, 28 Feb 2025).

In cyber risk benchmarks, the model’s lower success rate (14–21%) on zero-shot program exploitation tasks reflects limitations in multi-step vulnerability reasoning and coverage, underlining the ongoing need for human supervision and the risks posed by automated exploit discovery (Ristea et al., 29 Oct 2024).

6. Implications, Recommendations, and Future Work

Research consistently finds Claude-3.5 Sonnet to be highly effective for knowledge-intensive tasks with strong domain generalization, moderate cost and latency, and best-in-class safety controls for its cohort. Recommendations include domain-specific fine-tuning, chain-of-thought prompt design, integration with tool-use and retrieval-augmented agents, and continuous adversarial auditing for both safety and capability updates (Barradas et al., 5 Sep 2025, Syed et al., 15 Aug 2024). For medical and clinical use, human-in-the-loop oversight, dataset augmentation in non-English domains, and explainability monitoring are identified as necessary for reliability and safety (Safavi-Naini et al., 25 Aug 2024, Truyts et al., 26 Jul 2025).

Table: Representative Task Results for Claude-3.5 Sonnet

Domain	Metric/Score	Reference
OlympicArena, Natural Science	Physics 31.16%, Chem 47.27%	(Huang et al., 24 Jun 2024)
TransportBench, Zero-shot Eng. Accuracy	67.1%	(Syed et al., 15 Aug 2024)
Gastroenterology Board (Medical, EN)	74.0%	(Safavi-Naini et al., 25 Aug 2024)
HCFMUSP Residency (Medical, PT)	69.6–70.3%	(Truyts et al., 26 Jul 2025)
REST API Test Success Rate	100% (71.7% coverage)	(Barradas et al., 5 Sep 2025)
Automated Cyber Exploitation SR	14.3–21.4%	(Ristea et al., 29 Oct 2024)
Bias/Jailbreak Attack Rate	0.09 (hidden meaning), ≤13.3%	(Saeed et al., 31 Oct 2024, Erziev, 28 Feb 2025)
Holistic Scoring QWK (AP Chinese)	0.576–0.896	(Jiao et al., 24 May 2025)

Claude-3.5 Sonnet exemplifies the current frontier in proprietary LLMs with a balanced profile: strong domain transfer, robust safety, and limited but non-negligible failure modes. Increasing performance across multilingual, multimodal, and reasoning-intense tasks—as well as resistance to emergent adversarial prompt artifacts—remains a critical trajectory for future LLM development.