HealthBench
OpenAI benchmark of 5,000 realistic health conversations graded against physician-written rubrics. LLMReference tracks the length-adjusted score when available to reduce verbosity bias. High benchmark score alone doesn't make a model the right pick — weigh it against pricing, API availability, and release date.
Models ranked
3
tracked on this benchmark
Score band
57.0 – 55.8
best → lowest tracked
Snapshot trend
—
need ≥2 snapshots
Leaderboard
Tracked models ranked by Length-adjusted score (higher is better).
How to read this benchmark
This benchmark scores models where higher is better. Scores are useful for directional filtering and shortlisting — not for universal quality ranking. Prefer benchmarks closest to your workload, then validate the linked model pages for pricing, context window, and provider availability.
Trust this score when
- There is a fresh timestamped snapshot (or multiple snapshots) for this benchmark.
- The model list covers the same version family you can actually deploy today.
- Top candidates overlap with your required routing and feature requirements.
Be cautious when
- There is only one benchmark snapshot or the dataset appears stale.
- The benchmark metric direction is opposite of your decision objective.
- The score difference between options is narrow and likely within implementation variance.
FAQ
What does the HealthBench benchmark measure?
OpenAI benchmark of 5,000 realistic health conversations graded against physician-written rubrics. LLMReference tracks the length-adjusted score when available to reduce verbosity bias. On this page it ranks 3 tracked models where higher is better.
Is a higher HealthBench score always better?
For this benchmark, higher is better. A high score helps you shortlist, but confirm pricing, context window, and provider availability on each model page before committing — the top scorer is not always the right pick for your workload or budget.
How current is this HealthBench data?
This benchmark was last reviewed on Jun 26, 2026. Re-check the linked model pages for the freshest provider and pricing detail.
Related benchmarks
Last reviewed: Jun 26, 2026