News

Anthropic's Claude 3.5 Sonnet Benchmarks Beat GPT-4: What Changed

Claude 3.5 Sonnet outperforms GPT-4 on key benchmarks—technical analysis of improvements and implications for agent builders.

M
Max Beech· Founder
··5 min read

The News: Anthropic's Claude 3.5 Sonnet now outperforms GPT-4 on multiple benchmarks: MMLU (88.7% vs 86.4%), HumanEval coding (92% vs 67%), and instruction following.

Key Improvements:

1. Coding capability leap

HumanEval: 92% (Claude 3.5) vs 67% (GPT-4)

Agents writing code, API calls, or data transformations benefit significantly.

2. Instruction following

Better at following complex multi-step instructions without hallucinating steps.

3. Context coherence

200K token context window with better retention across full context vs GPT-4's 128K.

Implications for Agent Builders:

Switch to Claude 3.5 if:

  • Agent writes code or SQL queries
  • Long documents (>50K tokens)
  • Complex instruction chains
  • Cost-sensitive (3x cheaper than GPT-4)

Stick with GPT-4 if:

  • Heavily invested in OpenAI ecosystem
  • Using GPT-4V (vision) capabilities
  • Function calling maturity critical

Real-world test (support agent classification):

  • Claude 3.5: 91% accuracy, 1.6s latency, £14/1K queries
  • GPT-4 Turbo: 89% accuracy, 1.8s latency, £18/1K queries

Claude wins on all three metrics.

Sources:

  • Anthropic Claude 3.5 Announcement

More from the blog

Stop doing the work around the work

OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.