BenchPress - By Category

Category Heatmap (Top 15)

Composite score (0 to 100) per category for the top 15 overall models. Full leaderboard is on the Generalist page.

Category	claude-opus-4.8	gpt-5.3	claude-fable-5	claude-opus-4.7	claude-sonnet-4.6	gpt-5.5	gpt-5.2	gpt-5.4	claude-opus-4.6	gpt-5.1	claude-opus-4.5	claude-sonnet-4.5	gemini-3-flash	glm-5	kimi-k2.5
behavioural	93	95	90	95	91	97	90	89	93	91	91	91	87	88	90
coding	97	98	96	96	95	94	96	99	94	95	94	92	93	91	94
instruction following	90	97	93	89	91	93	97	96	86	90	85	90	92	97	81
learning	100	99	100	100	100	99	100	99	100	100	99	98	99	98	100
meta	100	85	100	98	98	82	86	86	93	94	97	93	89	79	87
reasoning	98	95	96	96	95	95	94	90	96	92	96	95	96	97	96
research	98	97	98	95	97	97	98	98	98	96	98	95	96	95	95
writing	98	97	96	95	97	98	96	97	98	97	97	98	98	98	98

Sycophancy resistance, calibration under social pressure, pushback against confident-but-wrong claims.

Code review, debugging, implementation. Tests pattern recognition, language-specific knowledge, and ability to spot subtle bugs.

Strict format and constraint adherence: exact list lengths, ordered steps, banned words, structural rules.

Explanatory writing on technical topics. Tests how well the model teaches a concept to a target audience.

Multi-step quantitative reasoning, Fermi estimation, logical deduction, statistical analysis.

Open-ended research and synthesis: comparisons, tradeoff analysis, design recommendations.

Production writing (docs, summaries, explanations) with constraints on length, audience, and format.