Kimi K2.6
- GPQA Diamond
- 90.5%
- Output (from)
- $3.50 / 1M
Last refreshed 2026-05-18. Next refresh: weekly.
Top open-weight language models you can run locally or self-host, ranked by sourced capability signals, parameter scale, and release freshness.
Opinionated short stack for this category — scroll for the full leaderboard, pricing, and compare links.
Open-weight boards emphasize GPQA Diamond (harder to game than broad MMLU), then MMLU, then recency.
| # | Model | Input $/1M | Output $/1M | |
|---|---|---|---|---|
| 1 | Kimi K2.6 ReasoningVisionTools GPQA Diamond: 90.5% | $0.75 | $3.50 | |
| 2 | DeepSeek V4 Pro ReasoningTools GPQA Diamond: 90.1% | $0.43 | $0.87 | |
| 3 | Qwen3.5-397B-A17B ReasoningTools GPQA Diamond: 89.3% | $0.39 | $2.34 | |
| 4 | Trinity-Large-Thinking ReasoningTools GPQA Diamond: 89.2% | $0.22 | $0.85 | |
| 5 | DeepSeek V4 Flash ReasoningTools GPQA Diamond: 88.1% | $0.14 | $0.28 | |
| 6 | Kimi K2.5 Tools GPQA Diamond: 87.9% | $0.44 | $2.00 | |
| 7 | Qwen3.6-27B ReasoningVisionTools GPQA Diamond: 87.8% | $0.32 | $3.20 | |
| 8 | GLM-5.1 ReasoningTools GPQA Diamond: 86.2% | $1.05 | $3.50 | |
| 9 | Qwen3-235B-A22B GPQA Diamond: 86.1% | $0.40 | $1.20 | |
| 10 | Qwen3.6-35B-A3B Tools GPQA Diamond: 86% | $0.15 | $1.00 | |
| 11 | Qwen3.5-27B ReasoningVisionTools GPQA Diamond: 85.8% | $0.20 | $1.56 | |
| 12 | Gemma 4 31B Tools GPQA Diamond: 85.7%Tied within margin | — | — | |
| 13 | Qwen3.5-122B-A10B ReasoningVisionTools GPQA Diamond: 85.7% | $0.26 | $2.08 | |
| 14 | Qwen3.5-35B-A3B ReasoningTools GPQA Diamond: 84.5% | $0.16 | $1.30 | |
| 15 | DeepSeek V3.2 GPQA Diamond: 84% | $0.25 | $0.38 | |
| 16 | Qwen3.5-9B VisionTools GPQA Diamond: 81.7% | $0.10 | $0.15 | |
| 17 | DeepSeek R1 0528 Reasoning GPQA Diamond: 81% | $0.10 | $0.30 | |
| 18 | Gemma 4 26B A4B IT Tools GPQA Diamond: 79.2% | $0.06 | $0.33 | |
| 19 | K-EXAONE 236B-A23B GPQA Diamond: 78.3% | — | — | |
| 20 | gpt-oss-120b Tools GPQA Diamond: 78.2% | $0.04 | $0.18 |
Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.
Arcee AI's flagship 400B sparse MoE reasoning model with 13B active parameters per token. Trained on 20T tokens with a STEM-focused curriculum. Designed for agentic workflows, chain-of-thought reasoning, and long-context tasks up to 256K tokens (BF16 API). Open-source under Apache 2.0. Available via Arcee AI API.
89.2%
GPQA Diamond
DeepSeek V4 Flash is a 284B parameter (13B activated) Mixture-of-Experts language model with 1M-token context. Features a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for efficient long-context inference. Supports thinking and non-thinking modes. Legacy API aliases deepseek-chat and deepseek-reasoner map to this model's non-thinking and thinking modes respectively. Pricing: $0.14/1M input, $0.28/1M output (cache hit: $0.0028/1M input). MIT licensed.
88.1%
GPQA Diamond
Moonshot Kimi K2.5 available on AWS Bedrock
87.9%
GPQA Diamond