Single-source resultMiniMax M2.7 scored 80.4% on MMLU-Pro, more than five points above the next GA score (56.0%). We dropped it one GA rank until another source corroborates the result.

How we rank

Small models (≤10B active parameters) rank on MMLU-Pro, then GPQA Diamond, MMLU, and HellaSwag.

Eligibility — Non-deprecated models with ≤10B parameters (billions-only parser).
Primary ranking — MMLU-Pro, then GPQA Diamond, then MMLU, then HellaSwag, then newer release.
Podium freshness — Shortlist cards require `lastResearched` within 60 days and a tracked public output price. Stale or unpriced SKUs stay in the table with a “Verify pricing” badge once research is past 45 days.
Variant collapse — We keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
Pricing — SLMs often win on unit economics — compare the provider ladder before picking.

MMLU-Pro

#	Model	MMLU-Pro	Context	Input $/1M	Output $/1M
1	Granite 4.1 8B MMLU-Pro: 55.99%	55.99%	131k	$0.05	$0.10
2	MiniMax M2.7 ReasoningTools MMLU-Pro: 80.43%	80.43%	205k	$0.28	$1.20
3	Phi-4 Mini MMLU-Pro: 52.8%	52.8%	128k	$0.90	$0.90
4	Gemma 2 9B MMLU-Pro: 52.08%	52.08%	8k	$0.06	$0.18
5	LFM2.5 8B A1B ReasoningTools MMLU-Pro: 50.5%	50.5%	128k	—	—
6	Granite 4.1 3B MMLU-Pro: 49.83%	49.83%	131k	—	—
7	Phi-3 Mini 4k MMLU-Pro: 45.66%	45.66%	4k	$0.05	$0.25
8	LFM2.5 1.2B Instruct Tools MMLU-Pro: 44.35%	44.35%	32k	—	—
9	Llama 3.1 8B Instruct MMLU-Pro: 44.25%	44.25%	128k	$0.02	$0.05
10	Llama 3 8B Instruct MMLU-Pro: 40.5%	40.5%	8k	$0.02	$0.04
11	Llama 3.2 3B Instruct MMLU-Pro: 34.7%	34.7%	128k	$0.03	$0.05
12	Llama 3.2 1B Instruct MMLU-Pro: 20%	20%	128k	$0.03	$0.10
13	Qwen3-8B MMLU-Pro: —	—	128k	$0.04	$0.14
14	Qwen2-7B MMLU-Pro: —	—	128k	$0.05	$0.15
15	Gemma 7B Instruct MMLU-Pro: —	—	8k	$0.05	$0.07
16	OpenChat 3.5 (0106) MMLU-Pro: —	—	8k	$0.07	$0.07
17	Starling LM 7B Beta MMLU-Pro: —	—	8k	—	—
18	Zephyr 7B Beta MMLU-Pro: —	—	—	$0.05	$0.20
19	Qwen2.5-7B-Instruct MMLU-Pro: —	—	128k	$0.03	$0.03
20	Aya 23 8B MMLU-Pro: —	—	—	—	—

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

#4LFM2.5 8B A1B
LFM2.5-8B-A1B is Liquid AI's latest on-device mixture-of-experts model, succeeding LFM2-8B-A1B. It has 8.3B total parameters with approximately 1.5B active per token (the A1B label uses a rounded ~1B figure). The architecture combines 18 double-gated LIV convolutional layers with 6 GQA attention layers, trained on 38 trillion tokens. The context window expands to 128K tokens (up from 32K in the predecessor). It is a reasoning model that generates explicit chain-of-thought steps before producing its final answer, making reasoning tokens cheap due to the MoE design. Strong tool-calling, function-calling, and instruction-following capabilities make it well-suited for agentic workflows on edge hardware. Weights are openly available on Hugging Face under the lfm1.0 license.
50.5%
MMLU-Pro
#5Granite 4.1 3B
IBM Granite 4.1 3B is a dense decoder-only transformer instruct model with 131K token context. Supports multilingual dialog (12 languages), code (FIM), tool-calling, and RAG. Trained with SFT and RL alignment on an NVIDIA GB200 NVL72 cluster. Apache 2.0.
49.83%
MMLU-Pro
#6Phi-3 Mini 4k
The Phi-3 Mini-4K-Instruct model by Microsoft is an advanced, lightweight language model boasting 3.8 billion parameters, optimized for environments with limited computational resources. It excels in various natural language processing tasks, especially in reasoning, text generation, and maintaining multi-turn conversations. Trained on a mix of synthetic and high-quality data, the model is tailored for effective instruction-following. Despite its capabilities, it has limitations in factual knowledge and multilingual support, often requiring external resources to enhance accuracy. The model is ideal for commercial and research applications that demand efficient processing, such as mobile apps and real-time systems.
45.66%
MMLU-Pro

Compare Top Picks

Side-by-side comparison of the top picks by price, benchmark, and API access.

Granite 4.1 8B vs MiniMax M2.7 Granite 4.1 8B vs Phi-4 Mini Granite 4.1 8B vs Gemma 2 9B Granite 4.1 8B vs LFM2.5 8B A1B MiniMax M2.7 vs Phi-4 Mini MiniMax M2.7 vs Gemma 2 9B

Browse Other Categories

Best LLMs for Code Generation Best LLMs for RAG Best AI Agent Models 2026: SWE-bench Ranked Best LLMs for Classification Best Open Source LLMs Best Multimodal / Vision LLMs Best LLM for Translation in 2026 Best AI Image Models in 2026 Best AI Video Models in 2026 Best LLMs for Reasoning & Math Best LLMs for Function Calling & Tool Use Cheapest LLM APIs You Can Call Right Now Best Long Context LLMs Best Mainstream LLM APIs, Ranked Best LLMs for Enterprise Best Free LLMs You Can Use Right Now Best LLMs for Writing Best LLMs for Marketing Best LLMs for Customer Support

Frequently asked questions

Which LLM is best for small-model deployment?

MiniMax M2.7 is the current LLMReference top pick for small-model deployment. The verdict uses the stored category signal MMLU-Pro: 80.43%. Output pricing starts at $1.20 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.

How does MiniMax M2.7 compare to Phi-4 Mini for small-model deployment?

MiniMax M2.7 leads Phi-4 Mini in the visible shortlist on MMLU-Pro: 80.43% versus 52.8%. The pricing cards show MiniMax M2.7: output pricing starts at $1.20 per 1m tokens and Phi-4 Mini: output pricing starts at $0.90 per 1m tokens.

How does LLMReference rank LLMs for small-model deployment?

LLMReference ranks LLMs for small-model deployment from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Small models (≤10B active parameters) rank on MMLU-Pro, then GPQA Diamond, MMLU, and HellaSwag.

How often is this list updated?

The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.

How do you decide which models appear in the top 3?

The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.

Are preview or beta models included?

Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).

Can I compare two specific models head-to-head?

Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.

Is the pricing data real-time?

Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.