News

Anthropic's Claude 3.5 Sonnet Benchmarks Beat GPT-4: What Changed

Claude 3.5 Sonnet outperforms GPT-4 on key benchmarks—technical analysis of improvements and implications for agent builders.

OpenHelm Team· Content

·Sep 14, 2024·5 min read

The News: Anthropic's Claude 3.5 Sonnet now outperforms GPT-4 on multiple benchmarks: MMLU (88.7% vs 86.4%), HumanEval coding (92% vs 67%), and instruction following.

Key Improvements:

1. Coding capability leap

HumanEval: 92% (Claude 3.5) vs 67% (GPT-4)

Agents writing code, API calls, or data transformations benefit significantly.

2. Instruction following

Better at following complex multi-step instructions without hallucinating steps.

3. Context coherence

200K token context window with better retention across full context vs GPT-4's 128K.

Implications for Agent Builders:

Switch to Claude 3.5 if:

Agent writes code or SQL queries
Long documents (>50K tokens)
Complex instruction chains
Cost-sensitive (3x cheaper than GPT-4)

Stick with GPT-4 if:

Heavily invested in OpenAI ecosystem
Using GPT-4V (vision) capabilities
Function calling maturity critical

Real-world test (support agent classification):

Claude 3.5: 91% accuracy, 1.6s latency, £14/1K queries
GPT-4 Turbo: 89% accuracy, 1.8s latency, £18/1K queries

Claude wins on all three metrics.

Sources:

Anthropic Claude 3.5 Announcement

Anthropic's Claude 3.5 Sonnet Benchmarks Beat GPT-4: What Changed

More from the blog

OpenHelm vs runCLAUDErun: Which Claude Code Scheduler Is Right for You?

Claude Code vs Cursor Pro: Real Developer Cost Comparison