LLM Reference

Best Open Source LLMs (2026)

Last refreshed 2026-07-01. Next refresh: weekly.

The best open-weight LLMs in 2026, ranked by benchmark scores. Run locally, self-host, or deploy on your own infra — no API key required.

New open-weight throughput family to watch: Nemotron-Labs-Diffusion from NVIDIA, including 3B, 8B, and 14B diffusion language model variants.

Verdict

Use MiniMax M3 for self-hosted open-weight use today.

GLM-5.2 is the runner-up, 2 points back on GPQA Diamond.

Researched 2d agoWhy this pickMethodology

How we rank

Open-weight boards emphasize GPQA Diamond (harder to game than broad MMLU), then MMLU, then recency.

  1. EligibilityModels marked with supported open licenses/flags in seed data.
  2. Primary rankingGPQA Diamond, then MMLU, then newer release.
  3. Variant collapseWe keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
  4. PricingHosted open-weight pricing still varies wildly — use the rate card column for apples-to-apples.
#ModelInput $/1MOutput $/1M
1MiniMax M3
ReasoningVisionTools

GPQA Diamond: 92.9%

$0.30$1.20
2Qwen3.6-Max
Vision

GPQA Diamond: 91.8%

3GLM-5.2
ReasoningTools

GPQA Diamond: 91.2%

$1.40$4.40
4Kimi K2.6
ReasoningVisionTools

GPQA Diamond: 90.5%

$0.73$3.40
5Qwen3.6-Plus
VisionTools

GPQA Diamond: 90.4%

$0.33$1.95
6DeepSeek V4 Pro
ReasoningTools

GPQA Diamond: 90.1%

$0.43$0.87
7Qwen3.5-397B-A17B
ReasoningVisionTools

GPQA Diamond: 89.3%

$0.39$2.34
8Trinity-Large-Thinking
ReasoningTools

GPQA Diamond: 89.2%

$0.22$0.85
9Qwen3.5-Plus
Vision

GPQA Diamond: 88.4%

$0.30$1.80
10Ring-2.6-1T
ReasoningTools

GPQA Diamond: 88.27%

$0.07$0.63
11DeepSeek V4 Flash
ReasoningTools

GPQA Diamond: 88.1%

$0.10$0.20
12Qwen3.6-27B
ReasoningVisionTools

GPQA Diamond: 87.8%

$0.32$3.20
13DeepSeek V3 0324

GPQA Diamond: 87.6%

$0.27$1.12
14MiniMax M2.7
ReasoningTools

GPQA Diamond: 87.4%Tied within margin

$0.28$1.20
15Hunyuan Hy3 Preview
PreviewReasoningTools

GPQA Diamond: 87.2%

$0.07$0.26
16GLM-5.1
ReasoningTools

GPQA Diamond: 86.2%

$1.05$3.50
17Qwen3-235B-A22B

GPQA Diamond: 86.1%

$0.09$0.58
18Qwen3.6 Max Preview
PreviewReasoningVisionTools

GPQA Diamond: 86%

$1.04$6.24
19Qwen3.6-35B-A3B
VisionTools

GPQA Diamond: 86%

$0.15$1.00
20GLM-5
ReasoningTools

GPQA Diamond: 86%

$0.60$2.08

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

  • Qwen3.6-Plus is Alibaba Cloud's GA Qwen3.6 flagship for long-context reasoning, coding, tool use, and multimodal workflows. DashScope lists it with a 1M-token context window, structured output support, and standard public token pricing.

    90.4%

    GPQA Diamond

  • DeepSeek V4 Pro is DeepSeek's flagship open-weights model, released April 24 2026 under the MIT license. Architecture: 1.6T total / 49B active parameters, MoE with Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) hybrid — requiring only 27% of inference FLOPs vs standard 1M-context transformers — plus Manifold-Constrained Hyper-Connections (mHC) and Muon Optimizer. Context window: 1,000,000 tokens; max output: 384,000 tokens (Think Max mode requires >=384K context). Text-only (no vision/image input). Supports three reasoning modes: Non-Think, Think High, Think Max. Function calling, tool use, and structured outputs supported. Key benchmarks: SWE-bench Verified 80.6%, SWE-bench Pro 55.4%, LiveCodeBench 93.5%, GPQA Diamond 90.1%, MMLU-Pro 87.5%, Terminal-Bench 2.0 59.1% on BenchLM's independent June 2026 harness, and Chatbot Arena 1456 (2026-06-16). Current API pricing: $0.435/$0.87 per 1M input/output tokens; DeepSeek made the former 75% promotional rate permanent in May 2026.

    90.1%

    GPQA Diamond

  • Alibaba's largest Qwen3.5 model, featuring a Mixture-of-Experts architecture with 397B total parameters and 17B active per token (using 512 total experts with 10 routed + 1 shared active). Supports 201 languages with a native 262K token context window extensible to 1M tokens via YaRN. Includes a thinking/reasoning mode, tool calling with MCP integration, and unified vision-language capabilities through early fusion training.

    89.3%

    GPQA Diamond

Compare Top Picks

Side-by-side comparison of the top picks by price, benchmark, and API access.

Frequently asked questions

Which LLM is best for open-source and self-hosted use?

MiniMax M3 is the current LLMReference top pick for open-source and self-hosted use. The verdict uses the stored category signal GPQA Diamond: 92.9%. Output pricing starts at $1.20 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.

How does MiniMax M3 compare to GLM-5.2 for open-source and self-hosted use?

MiniMax M3 leads GLM-5.2 in the visible shortlist on GPQA Diamond: 92.9% versus 91.2%. The pricing cards show MiniMax M3: output pricing starts at $1.20 per 1m tokens and GLM-5.2: output pricing starts at $4.40 per 1m tokens.

How does LLMReference rank LLMs for open-source and self-hosted use?

LLMReference ranks LLMs for open-source and self-hosted use from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Open-weight boards emphasize GPQA Diamond (harder to game than broad MMLU), then MMLU, then recency.

How often is this list updated?

The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.

How do you decide which models appear in the top 3?

The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.

Are preview or beta models included?

Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).

Can I compare two specific models head-to-head?

Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.

Is the pricing data real-time?

Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.