LLM ReferenceLLM Reference

Best Open Source LLMs (2026)

Last refreshed 2026-05-18. Next refresh: weekly.

Top open-weight language models you can run locally or self-host, ranked by sourced capability signals, parameter scale, and release freshness.

Top three picks

Opinionated short stack for this category — scroll for the full leaderboard, pricing, and compare links.

How we rank

Open-weight boards emphasize GPQA Diamond (harder to game than broad MMLU), then MMLU, then recency.

  1. EligibilityModels marked with supported open licenses/flags in seed data.
  2. Primary rankingGPQA Diamond, then MMLU, then newer release.
  3. Variant collapseWe keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
  4. PricingHosted open-weight pricing still varies wildly — use the rate card column for apples-to-apples.
#ModelInput $/1MOutput $/1M
1Kimi K2.6
ReasoningVisionTools

GPQA Diamond: 90.5%

$0.75$3.50
2DeepSeek V4 Pro
ReasoningTools

GPQA Diamond: 90.1%

$0.43$0.87
3Qwen3.5-397B-A17B
ReasoningTools

GPQA Diamond: 89.3%

$0.39$2.34
4Trinity-Large-Thinking
ReasoningTools

GPQA Diamond: 89.2%

$0.22$0.85
5DeepSeek V4 Flash
ReasoningTools

GPQA Diamond: 88.1%

$0.14$0.28
6Kimi K2.5
Tools

GPQA Diamond: 87.9%

$0.44$2.00
7Qwen3.6-27B
ReasoningVisionTools

GPQA Diamond: 87.8%

$0.32$3.20
8GLM-5.1
ReasoningTools

GPQA Diamond: 86.2%

$1.05$3.50
9Qwen3-235B-A22B

GPQA Diamond: 86.1%

$0.40$1.20
10Qwen3.6-35B-A3B
Tools

GPQA Diamond: 86%

$0.15$1.00
11Qwen3.5-27B
ReasoningVisionTools

GPQA Diamond: 85.8%

$0.20$1.56
12Gemma 4 31B
Tools

GPQA Diamond: 85.7%Tied within margin

13Qwen3.5-122B-A10B
ReasoningVisionTools

GPQA Diamond: 85.7%

$0.26$2.08
14Qwen3.5-35B-A3B
ReasoningTools

GPQA Diamond: 84.5%

$0.16$1.30
15DeepSeek V3.2

GPQA Diamond: 84%

$0.25$0.38
16Qwen3.5-9B
VisionTools

GPQA Diamond: 81.7%

$0.10$0.15
17DeepSeek R1 0528
Reasoning

GPQA Diamond: 81%

$0.10$0.30
18Gemma 4 26B A4B IT
Tools

GPQA Diamond: 79.2%

$0.06$0.33
19K-EXAONE 236B-A23B

GPQA Diamond: 78.3%

20gpt-oss-120b
Tools

GPQA Diamond: 78.2%

$0.04$0.18

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

  • Arcee AI's flagship 400B sparse MoE reasoning model with 13B active parameters per token. Trained on 20T tokens with a STEM-focused curriculum. Designed for agentic workflows, chain-of-thought reasoning, and long-context tasks up to 256K tokens (BF16 API). Open-source under Apache 2.0. Available via Arcee AI API.

    89.2%

    GPQA Diamond

  • DeepSeek V4 Flash is a 284B parameter (13B activated) Mixture-of-Experts language model with 1M-token context. Features a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for efficient long-context inference. Supports thinking and non-thinking modes. Legacy API aliases deepseek-chat and deepseek-reasoner map to this model's non-thinking and thinking modes respectively. Pricing: $0.14/1M input, $0.28/1M output (cache hit: $0.0028/1M input). MIT licensed.

    88.1%

    GPQA Diamond

  • Moonshot Kimi K2.5 available on AWS Bedrock

    87.9%

    GPQA Diamond