LLM Reference

Best LLMs for Classification (2026)

Last refreshed 2026-06-27. Next refresh: weekly.

Compare models for routing, moderation, extraction, safety labels, and structured classification by sourced benchmark coverage and pricing.

Verdict

Use Gemma 2 27B for classification today.

GPT-5.5 is the runner-up; compare HellaSwag against MMLU.

Researched 39d agoWhy this pickMethodology

How we rank

Classification picks take the strongest score across MMLU-Pro, MMLU, and lighter classification benchmarks, then recency.

  1. EligibilityModels tagged for the classification decision task (routing, moderation, extraction, or classification benchmarks).
  2. Primary rankingMaximum score among MMLU-Pro, MMLU, HellaSwag, BoolQ, ANLI, ToxiGen coverage, then newer release.
  3. Podium freshnessShortlist cards require `lastResearched` within 60 days and a tracked public output price. Older or unpriced models remain in the full leaderboard with a visible “Verify pricing” reminder when research is older than 45 days.
  4. Variant collapseWe keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
  5. PricingLowest tracked commercial token pricing.
#ModelInput $/1MOutput $/1M
1Gemini 3.1 Pro Preview
PreviewVisionTools

Signal used: MMLU 98%

$2.00$12.00
2Llama 3.1 405B

Signal used: HellaSwag 95.8%

3DeepSeek V3
Tools

Signal used: HellaSwag 95.7%

$0.10$0.28
4Qwen2.5-72B-Instruct

Signal used: HellaSwag 95.6%

$0.18$0.28
5Llama 3.1 70B Instruct

Signal used: HellaSwag 94.2%

$0.40$0.40
6Mistral Large 2
VisionTools

Signal used: HellaSwag 93.8%

$0.48$2.40
7Mixtral 8x22B v0.1

Signal used: HellaSwag 93.8%

$0.65$0.65
8Falcon 180B

Signal used: HellaSwag 92.7%

9Gemma 2 27B

Signal used: HellaSwag 92.6%

$0.08$0.24
10GPT-5.5
ReasoningVisionTools

Signal used: MMLU 92.4%

$5.00$30.00
11Llama 3 70B

Signal used: HellaSwag 92.4%

$0.65$2.75
12Qwen2-7B

Signal used: HellaSwag 92%

$0.05$0.15
13Gemini 3 Pro
VisionTools

Signal used: MMLU-Pro 91.8%

$1.25$5.00
14Mistral NeMo Instruct (2407)

Signal used: HellaSwag 91.8%

$0.02$0.04
15DeepSeek Coder V2 Lite

Signal used: HellaSwag 91.4%

$0.50$0.50
16Claude Opus 4.6
ReasoningVisionTools

Signal used: MMLU 91.1%

$5.00$25.00
17Llama 3 8B Instruct

Signal used: HellaSwag 91.1%

$0.02$0.04
18Mixtral 8x7B

Signal used: HellaSwag 90.9%

$0.15$0.20
19Phi-3 Small 128K

Signal used: HellaSwag 90.8%

$0.35$1.05
20Mistral 7B Instruct v0.3
Tools

Signal used: HellaSwag 90.2%

$0.20$0.20

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

  • #4Gemini 3 Pro

    Google DeepMind's most advanced reasoning Gemini model. Part of the Gemini 3 series with frontier-class intelligence, multimodal understanding, and 1M token context window.

    91.8%

    MMLU-Pro

  • Mistral NeMo Instruct (2407) is MistralAI's Mistral NeMo model. It offers a 128K-token context window and scores 57.1 on GPQA.

    91.8%

    HellaSwag

  • #6DeepSeek Coder V2 Lite

    DeepSeek Coder V2 Lite is an open-source Mixture-of-Experts (MoE) language model specifically tailored for efficiency and cost-effectiveness in coding tasks. It operates with a 15.7B parameter count, but only 2.4B are active at any given time, making it comparable to GPT4-Turbo for code-centric applications. This model supports 338 programming languages and has an extended context length of 128K tokens, facilitating the handling of complex codebases and lengthy prompts. Its features encompass code generation, completion, understanding, and mathematical reasoning, making it versatile for diverse coding applications. Available on Hugging Face, Ollama, and other platforms, DeepSeek Coder V2 Lite offers accessible solutions for developers and researchers, with performance that rivals or surpasses some closed-source models.

    91.4%

    HellaSwag