Claude 3.5 Sonnet: Multimodal Performance

Updated 11 October 2025

Claude 3.5 Sonnet is a proprietary multimodal language model noted for competitive reasoning, multilingual capability, and strong safety features across varied benchmarks.
The model achieves superior accuracy in STEM, medical, and cultural evaluations, outpacing peers like GPT-4o in disciplines such as Physics, Chemistry, and Biology.
Robust safety, adversarial resilience, and iterative alignment protocols make Claude 3.5 Sonnet effective for complex tasks, while underlining the need for domain-specific fine-tuning.

Claude 3.5 Sonnet is a proprietary LLM and multimodal LLM (MLLM) evaluated across a broad set of academic and real-world benchmarks, with repeated emphasis on its competitive reasoning, multilingual capabilities, safety features, and task-specific strengths. In recent peer-reviewed and preprint studies, this model’s architecture and performance have been systematically assessed across domains including STEM disciplines, medical reasoning, ethical decision making, cybersecurity, long-context document analysis, multimodal tasks, and Arabic and Portuguese language understanding.

1. Core Capabilities and Multidisciplinary Benchmarking

Claude 3.5 Sonnet has consistently achieved competitive results on established multi-domain evaluation suites. In the OlympicArena benchmark, which uses an Olympic-style medal ranking system across a spectrum of academic disciplines, Claude 3.5 Sonnet closely trails GPT-4o in overall aggregate accuracy. Notably, Claude 3.5 Sonnet outperforms GPT-4o in Physics (31.16% vs. 30.01%), Chemistry (47.27% vs. 46.68%), and Biology (56.05% vs. 53.11%), showcasing advanced reasoning and deep scientific knowledge (Huang et al., 24 Jun 2024). These results are based on standardized accuracy metrics for non-programming tasks and unbiased pass@k for coding problems, such as

$\operatorname{pass}@k := \mathbb{E}\left[1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}\right],$

with typical values $n=5, k=1$ , and $c$ the count of correct completions.

The medal-based paradigm allows for domain-specific strengths to be clearly visualized, and positions Claude 3.5 Sonnet as a leading model in complex, causality-driven academic domains.

2. Medical and Multilingual Performance

Brazilian Portuguese Medical Reasoning

Claude 3.5 Sonnet demonstrates robust zero-shot performance on high-stakes assessments outside of English, notably on the Hospital das Clínicas medical residency entrance exam in Brazilian Portuguese. In text-only questions, accuracy is approximately 70.27%, closely matching the peak of human candidate score distributions (65%–70%). With multimodal questions (text + image), the model sustains a leading accuracy of 69.57%, reflecting both linguistic and clinical reasoning strength (Truyts et al., 26 Jul 2025).

Processing time per question increases in multimodal settings (13.02 seconds per question), yet the trade-off between speed and accuracy remains favorable compared to other state-of-the-art models (e.g., Claude 3 Opus at 24.68 s/question). However, as with peers, the model’s accuracy drops for image-centered tasks, underlying a persistent multimodal reasoning challenge. Explanatory rationales generated by the model are highly coherent in correct responses (>94%) but less so for errors (~87%), mirroring patterns observed with human candidates.

Arabic Language and Cultural Competence

In a large-scale, expert-curated Arabic Depth Mini Dataset (ADMD) spanning 10 domains and 42 subdomains, Claude 3.5 Sonnet achieves the highest overall accuracy (30%), with standout performance in mathematical theory (True accuracy 50%), Arabic language, and Islamic studies. Across domains with complex cultural and linguistic nuance, Claude 3.5 Sonnet demonstrates fewer serious errors than other models, though all systems face pronounced challenges in culturally loaded categories (Sibaee et al., 2 Jun 2025).

The evaluation relies on a comprehensive, LaTeX-structured scoring framework focusing on language rules, scientific writing, cultural values, and accuracy, reflecting the importance of balanced technical and cultural competencies for effective model deployment in diverse linguistic environments.

3. Strengths and Weaknesses Across Specialized Tasks

Aspect-Based Sentiment and Long-Context Reasoning

In Aspect-Based Sentiment Analysis (ABSA) with synthetic, multi-domain conversation datasets, Claude 3.5 Sonnet matches leading models in recall (near 0.98) across all domains (Technology, Healthcare, Finance, Legal), though precision is lower than DeepSeek-R1. The high recall and balanced F1-score (e.g., Precision 0.71, Recall 0.98, F1 0.82 in Technology) make it well-suited for applications prioritizing coverage over minimal false positives, albeit with slower inference (7 s vs. 2 s for Gemini 1.5 Pro) (Pandit et al., 30 May 2025).

In long-context sentiment analysis, especially for lengthy, complex, and sentiment-fluctuating infrastructure project documents, Claude 3.5 Sonnet outperforms GPT-4o in both zero-shot and few-shot scenarios with higher accuracy on dynamic, contamination-free datasets (Shamshiri et al., 15 Oct 2024). Methodologically, evaluations involve majority-voting label assignment and agreement metrics (e.g., Fleiss’ Kappa), underscoring rigorous annotation and benchmarking.

Medical Reasoning in English and Multimodal Gastroenterology

Performance on English medical board-style MCQs is similarly strong. On a 300-question gastroenterology board exam, Claude 3.5 Sonnet achieves 74.0% accuracy—matching human test-taker averages and nearly identical to GPT-4o—especially when configured with best-practice prompt engineering: expert mimicry, chain-of-thought reasoning, and explicit confidence scoring (Safavi-Naini et al., 25 Aug 2024). Model accuracy is sensitive to prompt and output format; function call-based structured output improves extraction by up to 10%.

However, in vision-language settings for engineering drawing information extraction, Claude 3.5 Sonnet in a zero-shot setting has lower precision (44.01%) but higher recall (37.27%) relative to GPT-4o and performs well below fine-tuned smaller models (e.g., Florence-2), underscoring the ongoing need for domain-specific tuning for precision-critical workflows (Khan et al., 6 Nov 2024).

4. Robustness, Safety, and Bias

Claude 3.5 Sonnet exhibits notable safety characteristics in red-teaming and adversarial robustness evaluations:

Jailbreak Resistance: The model shows lower attack success rates and higher adversarial robustness (F1=33.3%, Attack Success Rate 77.8%) in multi-step jailbreak scenarios simulating verbal attacks, when compared to leading alternatives like GPT-4o and Grok-2 Beta (Wang, 23 Nov 2024). Layered prompts attempting to bypass guardrails are less likely to compromise Claude 3.5 Sonnet.
Cultural and Social Bias: In a comprehensive Arab-centric red teaming paper across eight domains (e.g., religion, terrorism, women’s rights), Claude 3.5 Sonnet is the “safest” (lowest jailbreak rate) but still exhibits negative bias toward Arabs in 7/8 categories. This highlights the persistence of culturally targeted bias even under strong safety alignment (Saeed et al., 31 Oct 2024).
Ethical Dilemmas and Protected Attributes: Claude 3.5 Sonnet demonstrates more diverse attribute selections and less reinforcement of traditional power structures (e.g., higher frequencies for “Feminine” and “Androgynous” in gender, balanced sensitivity in intersectional cases), compared to GPT-3.5 Turbo. Sensitivity to linguistic referents (e.g., “Asian” vs. “Yellow”) highlights the granularity required for bias audit (Yan et al., 17 Jan 2025).

In causal reasoning and “illusion of causality” tasks (e.g., news headline generation), Claude 3.5 Sonnet is the most robust against both correlation-to-causation exaggeration and sycophantic prompt bias, aligning more closely with human press release benchmarks (≈22% causal illusion rate) than GPT-4o-Mini or Gemini-1.5-Pro (Carro et al., 15 Oct 2024).

5. Cybersecurity and Autonomous Agent Behavior

Claude 3.5 Sonnet is evaluated in complex real-world simulation frameworks that assess autonomous agents:

Exploitation and Penetration Testing: Within the Cybench and AI Cyber Risk Benchmark settings, Claude 3.5 Sonnet attains moderate success rates (unguided task success ≈17.5%, exploitation rates 11.76%–17.65%). The model is most effective when leveraging agent scaffolding (structured bash) and subtask guidance, able to automate penetration testing for moderately difficult vulnerabilities. Nonetheless, it underperforms top models like o1-preview (which reached 64.71%) due to near-miss output and inability on more complex exploitation chains, but remains cost-effective and fast (Zhang et al., 15 Aug 2024, Ristea et al., 29 Oct 2024).
Long-Term Coherent Reasoning: On the Vending-Bench test for multi-step, long-horizon economic tasks, Claude 3.5 Sonnet balances inventory and pricing to maintain positive net worth in most runs (mean ≈$2,217.93), but exhibits high performance variance, generally resulting from breakdowns in delivery-tracking logic or tangential error loops. The agent’s ability to autonomously acquire and manage capital is described as “necessary in many hypothetical dangerous AI scenarios”—underscoring both its utility and dual-use risk (Backlund et al., 20 Feb 2025).

6. Technical Innovations and Model Design

Advanced self-alignment and rule-based alignment protocols illustrate Claude 3.5 Sonnet’s capacity for robust, annotation-free compliance. The Iterative Graph Alignment (IGA) process combines Iterative Graph Prompting (IGP) as a teacher VLM (e.g., constructing logical graphs) and a Self-Aligned Incremental Learning (SAIL) student LLM with the assistance of helper models, yielding a 73.12% improvement in rule-based alignment versus naive prompting (Yu et al., 29 Aug 2024). This iterative supervised fine-tuning mechanism enables scalable patching of local “representation gaps” for rule-sensitive tasks.

In poetry and explainability, Claude 3.5 Sonnet reaches f1 scores above 0.92 on fixed poetic form detection (especially sonnets), but results reveal a reliance on pretraining data and a risk of memorization—accuracy drops are observed on manually digitized, web-rare sources. Similarly, for Farsi stance detection with explainable rationales, few-shot Claude 3.5 Sonnet leads in the human-judged Overall Explanation Score (≈87.8), confirming balanced generation of both correct classifications and complete, linguistically aligned explanations (Zarharan et al., 18 Dec 2024).

Technical investigations reveal that the model can assign hidden meanings to incomprehensible token sequences, likely due to spurious correlations in BPE tokenization, presenting a potential attack vector (ASR ≈0.09); though more robust than peers, this observation raises alignment and moderation concerns (Erziev, 28 Feb 2025).

7. Limitations and Future Directions

While Claude 3.5 Sonnet is a top performer, persistent issues include:

Drop-off in multimodal performance, both in accuracy and efficiency, particularly for image-based medical or technical queries.
Need for domain-specific fine-tuning for high-precision tasks; zero-shot performance is insufficient for applications like engineering drawing extraction without significant post-processing or human verification.
Residual cultural and social bias, even under strong safety and guardrail regimes.
Susceptibility to high-variance outcomes in long-term autonomous deployment, especially when exposed to ambiguous or underspecified operational contexts.

Future research directions highlighted include: enhanced retrieval-augmented generation, improved multimodal fusion methodologies, cross-lingual domain adaptation, iterative benchmark updating, more granular adversarial training, and deployment-oriented explainability protocols to ensure both accuracy and safety in clinical and high-stakes decision processes.

In summary, Claude 3.5 Sonnet represents a leading multimodal LLM with documented strengths in scientific reasoning, medical assessment, multilingual and multicultural tasks, ethical decision making, and robust safety design. Its ongoing evaluation across diverse academic and applied domains suggests areas of both immediate utility and required future research for safe, reliable, and equitable AI deployment.