LLM Reference

Best Small Language Models Under 10B Parameters (2026)

Last refreshed 2026-07-01. Next refresh: weekly.

The best small LLMs under 10B parameters in 2026 — fast, cheap, and deployable on-device or at the edge with strong benchmark scores.

Verdict

Use MiniMax M2.7 for small-model deployments today.

Phi-4 Mini is the runner-up, 28 points back on MMLU-Pro.

Researched 31d agoWhy this pickMethodology

Single-source resultMiniMax M2.7 scored 80.4% on MMLU-Pro, more than five points above the next GA score (56.0%). We dropped it one GA rank until another source corroborates the result.

How we rank

Small models (≤10B active parameters) rank on MMLU-Pro, then GPQA Diamond, MMLU, and HellaSwag.

  1. EligibilityNon-deprecated models with ≤10B parameters (billions-only parser).
  2. Primary rankingMMLU-Pro, then GPQA Diamond, then MMLU, then HellaSwag, then newer release.
  3. Podium freshnessShortlist cards require `lastResearched` within 60 days and a tracked public output price. Stale or unpriced SKUs stay in the table with a “Verify pricing” badge once research is past 45 days.
  4. Variant collapseWe keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
  5. PricingSLMs often win on unit economics — compare the provider ladder before picking.
#ModelInput $/1MOutput $/1M
1Granite 4.1 8B

MMLU-Pro: 55.99%

$0.05$0.10
2MiniMax M2.7
ReasoningTools

MMLU-Pro: 80.43%

$0.28$1.20
3Phi-4 Mini

MMLU-Pro: 52.8%

$0.90$0.90
4Gemma 2 9B

MMLU-Pro: 52.08%

$0.06$0.18
5LFM2.5 8B A1B
ReasoningTools

MMLU-Pro: 50.5%

6Granite 4.1 3B

MMLU-Pro: 49.83%

7Phi-3 Mini 4k

MMLU-Pro: 45.66%

$0.05$0.25
8LFM2.5 1.2B Instruct
Tools

MMLU-Pro: 44.35%

9Llama 3.1 8B Instruct

MMLU-Pro: 44.25%

$0.02$0.05
10Llama 3 8B Instruct

MMLU-Pro: 40.5%

$0.02$0.04
11Llama 3.2 3B Instruct

MMLU-Pro: 34.7%

$0.03$0.05
12Llama 3.2 1B Instruct

MMLU-Pro: 20%

$0.03$0.10
13Qwen3-8B

MMLU-Pro:

$0.04$0.14
14Qwen2-7B

MMLU-Pro:

$0.05$0.15
15Gemma 7B Instruct

MMLU-Pro:

$0.05$0.07
16OpenChat 3.5 (0106)

MMLU-Pro:

$0.07$0.07
17Starling LM 7B Beta

MMLU-Pro:

18Zephyr 7B Beta

MMLU-Pro:

$0.05$0.20
19Qwen2.5-7B-Instruct

MMLU-Pro:

$0.03$0.03
20Aya 23 8B

MMLU-Pro:

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

  • LFM2.5-8B-A1B is Liquid AI's latest on-device mixture-of-experts model, succeeding LFM2-8B-A1B. It has 8.3B total parameters with approximately 1.5B active per token (the A1B label uses a rounded ~1B figure). The architecture combines 18 double-gated LIV convolutional layers with 6 GQA attention layers, trained on 38 trillion tokens. The context window expands to 128K tokens (up from 32K in the predecessor). It is a reasoning model that generates explicit chain-of-thought steps before producing its final answer, making reasoning tokens cheap due to the MoE design. Strong tool-calling, function-calling, and instruction-following capabilities make it well-suited for agentic workflows on edge hardware. Weights are openly available on Hugging Face under the lfm1.0 license.

    50.5%

    MMLU-Pro

  • #5Granite 4.1 3B

    IBM Granite 4.1 3B is a dense decoder-only transformer instruct model with 131K token context. Supports multilingual dialog (12 languages), code (FIM), tool-calling, and RAG. Trained with SFT and RL alignment on an NVIDIA GB200 NVL72 cluster. Apache 2.0.

    49.83%

    MMLU-Pro

  • #6Phi-3 Mini 4k

    The Phi-3 Mini-4K-Instruct model by Microsoft is an advanced, lightweight language model boasting 3.8 billion parameters, optimized for environments with limited computational resources. It excels in various natural language processing tasks, especially in reasoning, text generation, and maintaining multi-turn conversations. Trained on a mix of synthetic and high-quality data, the model is tailored for effective instruction-following. Despite its capabilities, it has limitations in factual knowledge and multilingual support, often requiring external resources to enhance accuracy. The model is ideal for commercial and research applications that demand efficient processing, such as mobile apps and real-time systems.

    45.66%

    MMLU-Pro

Frequently asked questions

Which LLM is best for small-model deployment?

MiniMax M2.7 is the current LLMReference top pick for small-model deployment. The verdict uses the stored category signal MMLU-Pro: 80.43%. Output pricing starts at $1.20 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.

How does MiniMax M2.7 compare to Phi-4 Mini for small-model deployment?

MiniMax M2.7 leads Phi-4 Mini in the visible shortlist on MMLU-Pro: 80.43% versus 52.8%. The pricing cards show MiniMax M2.7: output pricing starts at $1.20 per 1m tokens and Phi-4 Mini: output pricing starts at $0.90 per 1m tokens.

How does LLMReference rank LLMs for small-model deployment?

LLMReference ranks LLMs for small-model deployment from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Small models (≤10B active parameters) rank on MMLU-Pro, then GPQA Diamond, MMLU, and HellaSwag.

How often is this list updated?

The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.

How do you decide which models appear in the top 3?

The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.

Are preview or beta models included?

Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).

Can I compare two specific models head-to-head?

Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.

Is the pricing data real-time?

Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.