MiniMax M2.7
- MMLU-Pro
- 80.43%
- Output (from)
- $1.20 / 1M
Last refreshed 2026-07-01. Next refresh: weekly.
The best small LLMs under 10B parameters in 2026 — fast, cheap, and deployable on-device or at the edge with strong benchmark scores.
Verdict
Phi-4 Mini is the runner-up, 28 points back on MMLU-Pro.
Single-source resultMiniMax M2.7 scored 80.4% on MMLU-Pro, more than five points above the next GA score (56.0%). We dropped it one GA rank until another source corroborates the result.
Small models (≤10B active parameters) rank on MMLU-Pro, then GPQA Diamond, MMLU, and HellaSwag.
| # | Model | Input $/1M | Output $/1M | |
|---|---|---|---|---|
| 1 | Granite 4.1 8B MMLU-Pro: 55.99% | $0.05 | $0.10 | |
| 2 | MiniMax M2.7 ReasoningTools MMLU-Pro: 80.43% | $0.28 | $1.20 | |
| 3 | Phi-4 Mini MMLU-Pro: 52.8% | $0.90 | $0.90 | |
| 4 | Gemma 2 9B MMLU-Pro: 52.08% | $0.06 | $0.18 | |
| 5 | LFM2.5 8B A1B ReasoningTools MMLU-Pro: 50.5% | — | — | |
| 6 | Granite 4.1 3B MMLU-Pro: 49.83% | — | — | |
| 7 | Phi-3 Mini 4k MMLU-Pro: 45.66% | $0.05 | $0.25 | |
| 8 | LFM2.5 1.2B Instruct Tools MMLU-Pro: 44.35% | — | — | |
| 9 | Llama 3.1 8B Instruct MMLU-Pro: 44.25% | $0.02 | $0.05 | |
| 10 | Llama 3 8B Instruct MMLU-Pro: 40.5% | $0.02 | $0.04 | |
| 11 | Llama 3.2 3B Instruct MMLU-Pro: 34.7% | $0.03 | $0.05 | |
| 12 | Llama 3.2 1B Instruct MMLU-Pro: 20% | $0.03 | $0.10 | |
| 13 | Qwen3-8B MMLU-Pro: — | $0.04 | $0.14 | |
| 14 | Qwen2-7B MMLU-Pro: — | $0.05 | $0.15 | |
| 15 | Gemma 7B Instruct MMLU-Pro: — | $0.05 | $0.07 | |
| 16 | OpenChat 3.5 (0106) MMLU-Pro: — | $0.07 | $0.07 | |
| 17 | Starling LM 7B Beta MMLU-Pro: — | — | — | |
| 18 | Zephyr 7B Beta MMLU-Pro: — | $0.05 | $0.20 | |
| 19 | Qwen2.5-7B-Instruct MMLU-Pro: — | $0.03 | $0.03 | |
| 20 | Aya 23 8B MMLU-Pro: — | — | — |
Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.
LFM2.5-8B-A1B is Liquid AI's latest on-device mixture-of-experts model, succeeding LFM2-8B-A1B. It has 8.3B total parameters with approximately 1.5B active per token (the A1B label uses a rounded ~1B figure). The architecture combines 18 double-gated LIV convolutional layers with 6 GQA attention layers, trained on 38 trillion tokens. The context window expands to 128K tokens (up from 32K in the predecessor). It is a reasoning model that generates explicit chain-of-thought steps before producing its final answer, making reasoning tokens cheap due to the MoE design. Strong tool-calling, function-calling, and instruction-following capabilities make it well-suited for agentic workflows on edge hardware. Weights are openly available on Hugging Face under the lfm1.0 license.
50.5%
MMLU-Pro
IBM Granite 4.1 3B is a dense decoder-only transformer instruct model with 131K token context. Supports multilingual dialog (12 languages), code (FIM), tool-calling, and RAG. Trained with SFT and RL alignment on an NVIDIA GB200 NVL72 cluster. Apache 2.0.
49.83%
MMLU-Pro
The Phi-3 Mini-4K-Instruct model by Microsoft is an advanced, lightweight language model boasting 3.8 billion parameters, optimized for environments with limited computational resources. It excels in various natural language processing tasks, especially in reasoning, text generation, and maintaining multi-turn conversations. Trained on a mix of synthetic and high-quality data, the model is tailored for effective instruction-following. Despite its capabilities, it has limitations in factual knowledge and multilingual support, often requiring external resources to enhance accuracy. The model is ideal for commercial and research applications that demand efficient processing, such as mobile apps and real-time systems.
45.66%
MMLU-Pro
Side-by-side comparison of the top picks by price, benchmark, and API access.
MiniMax M2.7 is the current LLMReference top pick for small-model deployment. The verdict uses the stored category signal MMLU-Pro: 80.43%. Output pricing starts at $1.20 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.
MiniMax M2.7 leads Phi-4 Mini in the visible shortlist on MMLU-Pro: 80.43% versus 52.8%. The pricing cards show MiniMax M2.7: output pricing starts at $1.20 per 1m tokens and Phi-4 Mini: output pricing starts at $0.90 per 1m tokens.
LLMReference ranks LLMs for small-model deployment from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Small models (≤10B active parameters) rank on MMLU-Pro, then GPQA Diamond, MMLU, and HellaSwag.
The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.
The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.
Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).
Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.
Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.