MiniMax M3
- GPQA Diamond
- 92.9%
- Output (from)
- $1.20 / 1M
Last refreshed 2026-07-01. Next refresh: weekly.
The best open-weight LLMs in 2026, ranked by benchmark scores. Run locally, self-host, or deploy on your own infra — no API key required.
New open-weight throughput family to watch: Nemotron-Labs-Diffusion from NVIDIA, including 3B, 8B, and 14B diffusion language model variants.
Verdict
GLM-5.2 is the runner-up, 2 points back on GPQA Diamond.
Open-weight boards emphasize GPQA Diamond (harder to game than broad MMLU), then MMLU, then recency.
| # | Model | Input $/1M | Output $/1M | |
|---|---|---|---|---|
| 1 | MiniMax M3 ReasoningVisionTools GPQA Diamond: 92.9% | $0.30 | $1.20 | |
| 2 | Qwen3.6-Max Vision GPQA Diamond: 91.8% | — | — | |
| 3 | GLM-5.2 ReasoningTools GPQA Diamond: 91.2% | $1.40 | $4.40 | |
| 4 | Kimi K2.6 ReasoningVisionTools GPQA Diamond: 90.5% | $0.73 | $3.40 | |
| 5 | Qwen3.6-Plus VisionTools GPQA Diamond: 90.4% | $0.33 | $1.95 | |
| 6 | DeepSeek V4 Pro ReasoningTools GPQA Diamond: 90.1% | $0.43 | $0.87 | |
| 7 | Qwen3.5-397B-A17B ReasoningVisionTools GPQA Diamond: 89.3% | $0.39 | $2.34 | |
| 8 | Trinity-Large-Thinking ReasoningTools GPQA Diamond: 89.2% | $0.22 | $0.85 | |
| 9 | Qwen3.5-Plus Vision GPQA Diamond: 88.4% | $0.30 | $1.80 | |
| 10 | Ring-2.6-1T ReasoningTools GPQA Diamond: 88.27% | $0.07 | $0.63 | |
| 11 | DeepSeek V4 Flash ReasoningTools GPQA Diamond: 88.1% | $0.10 | $0.20 | |
| 12 | Qwen3.6-27B ReasoningVisionTools GPQA Diamond: 87.8% | $0.32 | $3.20 | |
| 13 | DeepSeek V3 0324 GPQA Diamond: 87.6% | $0.27 | $1.12 | |
| 14 | MiniMax M2.7 ReasoningTools GPQA Diamond: 87.4%Tied within margin | $0.28 | $1.20 | |
| 15 | Hunyuan Hy3 Preview PreviewReasoningTools GPQA Diamond: 87.2% | $0.07 | $0.26 | |
| 16 | GLM-5.1 ReasoningTools GPQA Diamond: 86.2% | $1.05 | $3.50 | |
| 17 | Qwen3-235B-A22B GPQA Diamond: 86.1% | $0.09 | $0.58 | |
| 18 | Qwen3.6 Max Preview PreviewReasoningVisionTools GPQA Diamond: 86% | $1.04 | $6.24 | |
| 19 | Qwen3.6-35B-A3B VisionTools GPQA Diamond: 86% | $0.15 | $1.00 | |
| 20 | GLM-5 ReasoningTools GPQA Diamond: 86% | $0.60 | $2.08 |
Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.
Qwen3.6-Plus is Alibaba Cloud's GA Qwen3.6 flagship for long-context reasoning, coding, tool use, and multimodal workflows. DashScope lists it with a 1M-token context window, structured output support, and standard public token pricing.
90.4%
GPQA Diamond
DeepSeek V4 Pro is DeepSeek's flagship open-weights model, released April 24 2026 under the MIT license. Architecture: 1.6T total / 49B active parameters, MoE with Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) hybrid — requiring only 27% of inference FLOPs vs standard 1M-context transformers — plus Manifold-Constrained Hyper-Connections (mHC) and Muon Optimizer. Context window: 1,000,000 tokens; max output: 384,000 tokens (Think Max mode requires >=384K context). Text-only (no vision/image input). Supports three reasoning modes: Non-Think, Think High, Think Max. Function calling, tool use, and structured outputs supported. Key benchmarks: SWE-bench Verified 80.6%, SWE-bench Pro 55.4%, LiveCodeBench 93.5%, GPQA Diamond 90.1%, MMLU-Pro 87.5%, Terminal-Bench 2.0 59.1% on BenchLM's independent June 2026 harness, and Chatbot Arena 1456 (2026-06-16). Current API pricing: $0.435/$0.87 per 1M input/output tokens; DeepSeek made the former 75% promotional rate permanent in May 2026.
90.1%
GPQA Diamond
Alibaba's largest Qwen3.5 model, featuring a Mixture-of-Experts architecture with 397B total parameters and 17B active per token (using 512 total experts with 10 routed + 1 shared active). Supports 201 languages with a native 262K token context window extensible to 1M tokens via YaRN. Includes a thinking/reasoning mode, tool calling with MCP integration, and unified vision-language capabilities through early fusion training.
89.3%
GPQA Diamond
Side-by-side comparison of the top picks by price, benchmark, and API access.
MiniMax M3 is the current LLMReference top pick for open-source and self-hosted use. The verdict uses the stored category signal GPQA Diamond: 92.9%. Output pricing starts at $1.20 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.
MiniMax M3 leads GLM-5.2 in the visible shortlist on GPQA Diamond: 92.9% versus 91.2%. The pricing cards show MiniMax M3: output pricing starts at $1.20 per 1m tokens and GLM-5.2: output pricing starts at $4.40 per 1m tokens.
LLMReference ranks LLMs for open-source and self-hosted use from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Open-weight boards emphasize GPQA Diamond (harder to game than broad MMLU), then MMLU, then recency.
The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.
The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.
Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).
Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.
Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.