Claude 3.5/3.7 Sonnet: Benchmarking & Capabilities

Updated 18 July 2025

Claude 3.5/3.7 Sonnet are advanced proprietary LLMs by Anthropic, designed for general-purpose intelligence with strong reasoning, factual accuracy, and multimodal understanding.
They achieved competitive results in benchmarks like OlympicArena and NPPC, excelling in physical sciences, engineering, and medical reasoning tasks.
While demonstrating reliable tool orchestration and performance, they exhibit limitations in deductive logic and code synthesis, prompting ongoing research into robust prompting and agentic modularity.

Claude 3.5/3.7 Sonnet are proprietary LLMs produced by Anthropic, forming part of the Claude 3 series. These models are designed for general-purpose artificial intelligence tasks, with a particular focus on reasoning, factual accuracy, tool orchestration, and multimodal understanding. Both variants have been systematically evaluated across an array of academic and industrial benchmarks and have demonstrated competitive or state-of-the-art performance in scientific, engineering, scoring, agentic, and ethical reasoning domains.

1. Benchmark Performance and Comparative Standing

Claude 3.5 Sonnet and Claude 3.7 Sonnet are highly ranked in large-scale benchmarking studies. On the OlympicArena benchmark—which tasks LLMs with a wide variety of multi-disciplinary and multi-modal challenges—Claude 3.5 Sonnet achieved an overall accuracy of 39.24%, closely matching GPT-4o (40.47%). Distinctively, Claude 3.5 Sonnet outperformed GPT-4o in Physical Sciences (Physics: 31.16% vs. 30.01%; Chemistry: 47.27% vs. 46.68%; Biology: 56.05% vs. 53.11%). However, GPT-4o was superior in mathematics (28.32% vs. 23.18%) and computer science (pass@1: 8.43% vs. 5.19%), indicating a comparative weakness in rigorous deductive logic and coding tasks (Huang et al., 2024).

On the NPPC "ever-scaling" reasoning benchmark of NP-complete problems, Claude 3.7 Sonnet consistently ranked among the top non-reasoning models. While reasoning-optimized models like DeepSeek-R1 sometimes surpassed it, Claude 3.7 Sonnet excelled when handling exceedingly long contexts (such as the Superstring problem, maintaining over 50% accuracy on the hardest levels), and demonstrated efficient token usage for large problem instances (Yang et al., 15 Apr 2025).

In geospatial analytics (GeoBenchX), Sonnet 3.5 led in workflow efficiency, matching GPT-4o in overall problem-solving ability, and was noted for balanced rejection of unsolvable tasks. Sonnet 3.7 produced longer, less concise solutions but still performed at a practical level (Krechetova et al., 23 Mar 2025).

2. Task-Specific Strengths and Limitations

Claude 3.5/3.7 Sonnet’s strengths are especially pronounced in tasks requiring broad knowledge integration:

Scientific Reasoning: Excellent fusion of domain-specific background and inference, particularly in the physical sciences (Huang et al., 2024).
Engineering Problem Solving: Superb accuracy (67.1%) and detailed, consistent reasoning on undergraduate- and graduate-level transportation system engineering (TransportBench), outperforming GPT-4/4o especially on advanced topics (Syed et al., 2024).
Medical Reasoning: Achieved 74.0% accuracy on medical board-style questions in gastroenterology, outperforming other Claude 3 models and matching GPT-4o when optimized with prompt engineering and parameter tuning (Safavi-Naini et al., 2024).
Vision-Language CAD Feature Recognition: Using multi-view, chain-of-thought, and sequential prompting, Claude 3.5 Sonnet achieved the highest feature quantity (74%) and name-matching accuracy (75%) on manufacturing CAD models, as well as the lowest mean absolute error (MAE = 3.2) (Khan et al., 2024).

Notable limitations include:

Deductive and Coding Reasoning: Lower performance on tasks requiring highly structured logical deduction or code synthesis compared to GPT-4o (Huang et al., 2024).
Self-Checking Consistency: In engineering domains, prompting Claude 3.5 Sonnet to double-check solutions sometimes introduced new errors or answer reversals, in contrast to GPT-4/4o, which improved or maintained answer quality under self-check prompts (Syed et al., 2024).
Robustness to Visual Inputs: While proficient at text-based causal and compositional reasoning (75–81% accuracy), accuracy degraded (to 47%) when integrating image inputs for affordance reasoning, compared to GPT-4o which maintained performance (Gjerde et al., 23 Feb 2025).
Ethical Moderation: Claude 3.7 Sonnet enforces absolute refusals of sexually explicit or intimate content, returning identical, categorical rejections regardless of prompt context, reflecting a strict deontological ethics design (Lai, 5 Jun 2025).

3. Reasoning, Causality, and Bias

Comprehensive experiments demonstrate Claude 3.5/3.7 Sonnet’s competitiveness in abstract reasoning and its resilience to cognitive pitfalls:

Causal and Compositional Reasoning: Achieves near-human performance on standard (text) tasks involving functional object substitution, improves with chain-of-thought prompting, but suffers on image-based variants (Gjerde et al., 23 Feb 2025).
Causal Illusion Avoidance: Claude 3.5 Sonnet is least prone among peers to infer false causality from spurious correlations in news summarization tasks and shows strong resistance to sycophant mimicry, outperforming GPT-4o-Mini and Gemini 1.5 Pro (Carro et al., 2024).
Ethical Decision-Making Bias: In simulated ethical dilemmas, Claude 3.5 Sonnet exhibited more diverse and balanced preferences over protected attributes (age, gender, race, etc.) than GPT-3.5 Turbo, with less alignment to traditional power structures, and more context-sensitive stability in complex, intersectional scenarios (Yan et al., 17 Jan 2025).

4. Practical Applications and Agentic Orchestration

Claude 3.5/3.7 Sonnet have been directly integrated into agentic frameworks across multiple domains:

Cybersecurity: Using a structured agent scaffold (with Reflection, Plan, Thought, Action), Claude 3.5 Sonnet autonomously identified vulnerabilities and executed exploits on professional-level CTF tasks, matching or exceeding GPT-4o on unguided tasks solvable by humans in under 11 minutes (Zhang et al., 2024).
Malware Analysis: As the core LLM in the r2ai plugin for Radare2, both Sonnet models rapidly interpret assembly/disassembly context, suggest tool invocations, and speed Linux/IoT malware reverse engineering by 1–3 days per sample compared to manual methods, with the caveat that experienced analyst oversight is essential due to occasional hallucinations and omissions (Apvrille et al., 10 Apr 2025).
Automated Scoring: Claude 3.5 Sonnet delivers high scoring accuracy (Quadratic Weighted Kappa values up to 0.708), high intra-rater consistency (Cronbach α = 0.80–0.95), and only slight leniency in rater effect per the Many-Facet Rasch model. On some holistic essay tasks, its agreement with human raters exceeds that between human raters themselves, supporting deployment for automated low-stakes assessment (Jiao et al., 24 May 2025).
Mobile GUI Agents: In interactive mobile GUI environments (AndroidWorld), Claude 3.7 Sonnet with chain-of-thought reasoning (“Thinking” variant) delivered state-of-the-art task completion rates (improving baseline by 9.5 percentage points), although in static benchmarks, reasoning sometimes degraded accuracy, and incurred high token costs (up to 14× more output tokens) (Zhang et al., 21 Mar 2025).
General-Purpose Computer Use Agents: As a data synthesizer in PC Agent-E’s Trajectory Boost step, Claude 3.7 Sonnet generated diverse alternative actions per trajectory step, creating branching “Traj Trees” that drove a 141% improvement over strong LLM baselines on WindowsAgentArena-V2 and considerable cross-platform generalization (He et al., 20 May 2025).
Drug Discovery Agents: Both Claude-3.5-Sonnet and Claude-3.7-Sonnet excelled in orchestrating tool-calling and code-generating agents to answer industry-standard cheminformatics questions, matching GPT-4o in aggregated LLM-as-judge scores and revealing that prompt and agent architecture selection is critical for optimal performance (Weesep et al., 27 Jun 2025).

5. Model Robustness, Safety, and Security

Claude 3.5/3.7 Sonnet models possess advanced abilities in pattern recognition, but these also introduce new safety and interpretability challenges:

Recognition of Hidden Meanings: Both models (especially Claude 3.7 Sonnet) can decode seemingly random Unicode or visually incomprehensible symbol sequences, a phenomenon tied to BPE tokenization artifacts. This confers capabilities for covert communication, but also exposes security vulnerabilities, as adversaries may use such encodings to bypass filters (albeit with relatively low attack success rates under default settings) (Erziev, 28 Feb 2025).
Memorization vs. Understanding: In poetic form recognition tasks, Claude 3.5/3.7 scores extremely high for canonical forms such as sonnets (F₁ ≈ 0.95), but this performance is partially attributed to memorization from pretraining data. The models can falter on unconventional forms or less-documented inputs (Walsh et al., 2024).
Token Economy and Efficiency: In long or multi-step tasks (e.g., NPPC, GeoBenchX), Claude 3.7 Sonnet demonstrated efficient token usage relative to competitor models (such as o-series GPT-4o variants), a quality that benefits large-context and real-time applications but may come at the cost of less granular internal reasoning (Yang et al., 15 Apr 2025, Krechetova et al., 23 Mar 2025).

6. Nuanced Reasoning and Modularity in Agentic Systems

Empirical analysis of agent systems employing Claude 3.5/3.7 Sonnet highlights the importance—and complexity—of modular agent design:

Substituting different LLMs or agent types (tool-calling vs. code-generating) within modular architecture is not trivial. Task performance is sensitive to the chosen prompt, agent structure, and domain specificity. Although Claude models performed strongly across configurations, agent and prompt type influenced outcomes in highly question-dependent ways (Weesep et al., 27 Jun 2025).
Interchangeability of model components requires prompt re-engineering and careful orchestration to prevent degradation and to maximize reliability when deploying LLM-based agents in real-world scenarios.

7. Implications and Outlook

Claude 3.5/3.7 Sonnet represent a class of high-performing, general-purpose LLMs whose main strengths lie in factual reasoning, scientific analysis, robust tool orchestration, and textual understanding across a broad spectrum of tasks. Their limitations—observed in multimodal integration, fine-grained code synthesis, deductive reasoning, and complex ethical or safety-sensitive contexts—are actively being investigated. Empirical results suggest that continued research is needed into refined prompting, more robust self-checking, secure and explainable model behavior, and agentic modularity in order to advance toward artificial general intelligence.

The systematic evaluation of these models, as expressed through rigorous academic benchmarking and domain-specific deployments, provides a critical reference for researchers developing or deploying LLM-based intelligent systems across science, engineering, the humanities, and computational agent frameworks.