51 models · 80 prompts · 8 categories · Judges: gpt-4.1, claude-sonnet-4.6, gemini-2.5-flash, qwen3-235b · Updated Jun 10, 2026 12:27
Behavioural
gpt-5.5
97
Coding
gpt-5.4
99
Instruction Following
gpt-oss-120b
98
Learning
claude-fable-5
100
Meta
claude-opus-4.8
100
Reasoning
claude-opus-4.8
98
Research
claude-fable-5
98
Writing
gemini-3-pro
99

Category Heatmap (Top 15)

Composite score (0 to 100) per category for the top 15 overall models. Full leaderboard is on the Generalist page.

Category claude-opus-4.8gpt-5.3claude-fable-5claude-opus-4.7claude-sonnet-4.6gpt-5.5gpt-5.2gpt-5.4claude-opus-4.6gpt-5.1claude-opus-4.5claude-sonnet-4.5gemini-3-flashglm-5kimi-k2.5
behavioural939590959197908993919191878890
coding979896969594969994959492939194
instruction following909793899193979686908590929781
learning10099100100100991009910010099989998100
meta10085100989882868693949793897987
reasoning989596969595949096929695969796
research989798959797989898969895969595
writing989796959798969798979798989898

Top 5 Across Categories

Composite score (0 to 100) per category for the top 5 overall models. Wider polygon = more consistent across categories.

Behavioural

Sycophancy resistance, calibration under social pressure, pushback against confident-but-wrong claims.

Coding

Code review, debugging, implementation. Tests pattern recognition, language-specific knowledge, and ability to spot subtle bugs.

Instruction Following

Strict format and constraint adherence: exact list lengths, ordered steps, banned words, structural rules.

Learning

Explanatory writing on technical topics. Tests how well the model teaches a concept to a target audience.

Meta

Calibration and self-awareness: recognising false premises, hedging appropriately, knowing when to refuse.

Reasoning

Multi-step quantitative reasoning, Fermi estimation, logical deduction, statistical analysis.

Research

Open-ended research and synthesis: comparisons, tradeoff analysis, design recommendations.

Writing

Production writing (docs, summaries, explanations) with constraints on length, audience, and format.