Claude 2/3/3.5 Benchmarks & Reviews

This page compiles assessments and evaluations of Claude models, The benchmarks measure Claude models’ capabilities as a language model across diverse NLP tasks including textual entailment, question answering, summarization, and dialogue. I hope this page can give you a comprehensive overview of Claude models’ language proficiencies and how they compare to other state-of-the-art AI systems.

Claude in the history of Large Language Models

Benchmarks & Reviews

Here are the scores of Claude 3.5 Sonnet:

Here are the scores of Claude 3 Models:

Here are the scores of Claude 2 in all the popular tests:

76.5% (Claude 2 score on Bar exam multiple choice)
73.0% (Claude 1.3 score on Bar exam multiple choice)
90th percentile (Claude 2 GRE reading/writing score compared to grad school applicants)
median (Claude 2 GRE quantitative reasoning score compared to grad school applicants)
71.2% (Claude 2 score on Codex HumanEval)
56.0% (Previous Claude score on Codex HumanEval)
88.0% (Claude 2 score on GSM8k math problems)
85.2% (Previous Claude score on GSM8k math problems)
2x better (Claude 2 vs Claude 1.3 at giving harmless responses)

Reviews from various sources:

How Good is the Claude 2 AI at Working With PDFs? – Let’s Find Out – page
Model Card and Evaluations for Claude Models – PDF
Claude 3.5 Sonnet Model Card Addendum – PDF
Claude 3 Model Card – PDF
ARB: Advanced Reasoning Benchmark for Large Language Models – PDF
LLM hallucinations graded – Google Sheet
Llama 2 vs Claude 2 vs GPT-4 – video
After using Claude 2 by Anthropic for 12 hours straight, here’s what I found – Reddit Discussion
How strong is Claude 2? – video
What to Know About Claude 2, Anthropic’s Rival to ChatGPT – page

Got a question or a recommendation? Please send me a message at [email protected].

Claude 2/3/3.5 Benchmarks & Reviews

Benchmarks & Reviews

More Claude Basics

👋Stay in the know of all the cool Claude tricks, no BS.

Claude 3.5 Sonnet