Which LLM is best for classification?

Gemma 2 27B is the current LLMReference top pick for classification. The verdict uses the stored category signal HellaSwag: 92.6%. Output pricing starts at $0.24 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.

How does Gemma 2 27B compare to GPT-5.5 for classification?

Gemma 2 27B is the top visible pick with HellaSwag: 92.6%; GPT-5.5 is the runner-up with MMLU: 92.4%. The pricing cards show Gemma 2 27B: output pricing starts at $0.24 per 1m tokens and GPT-5.5: output pricing starts at $30.00 per 1m tokens.

How does LLMReference rank LLMs for classification?

LLMReference ranks LLMs for classification from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Classification picks take the strongest score across MMLU-Pro, MMLU, and lighter classification benchmarks, then recency.

Best LLMs for Classification (2026)

Last refreshed 2026-06-27. Next refresh: weekly.

Compare models for routing, moderation, extraction, safety labels, and structured classification by sourced benchmark coverage and pricing.

Verdict

Use Gemma 2 27B for classification today.

GPT-5.5 is the runner-up; compare HellaSwag against MMLU.

Researched 39d agoWhy this pick Methodology

1stTop pick

Researched 39d ago

Gemma 2 27B

HellaSwag: 92.6%
Output (from): $0.240 / 1M

Try on provider Model detail Compare

2ndShortlist

Researched 14d ago

GPT-5.5

MMLU: 92.4%
Output (from): $30.00 / 1M

Try on provider Model detail Compare

3rdShortlist

Researched 39d ago

Qwen2-7B

HellaSwag: 92%
Output (from): $0.150 / 1M

Try on provider Model detail Compare

How we rank

Classification picks take the strongest score across MMLU-Pro, MMLU, and lighter classification benchmarks, then recency.

Eligibility — Models tagged for the classification decision task (routing, moderation, extraction, or classification benchmarks).
Primary ranking — Maximum score among MMLU-Pro, MMLU, HellaSwag, BoolQ, ANLI, ToxiGen coverage, then newer release.
Podium freshness — Shortlist cards require `lastResearched` within 60 days and a tracked public output price. Older or unpriced models remain in the full leaderboard with a visible “Verify pricing” reminder when research is older than 45 days.
Variant collapse — We keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
Pricing — Lowest tracked commercial token pricing.

MMLU-Pro MMLU

#	Model	Signal used	Context	Input $/1M	Output $/1M
1	Gemini 3.1 Pro Preview PreviewVisionTools Signal used: MMLU 98%	MMLU 98%	1m	$2.00	$12.00
2	Llama 3.1 405B Signal used: HellaSwag 95.8%	HellaSwag 95.8%	128k	—	—
3	DeepSeek V3 Tools Signal used: HellaSwag 95.7%	HellaSwag 95.7%	64k	$0.10	$0.28
4	Qwen2.5-72B-Instruct Signal used: HellaSwag 95.6%	HellaSwag 95.6%	128k	$0.18	$0.28
5	Llama 3.1 70B Instruct Signal used: HellaSwag 94.2%	HellaSwag 94.2%	128k	$0.40	$0.40
6	Mistral Large 2 VisionTools Signal used: HellaSwag 93.8%	HellaSwag 93.8%	128k	$0.48	$2.40
7	Mixtral 8x22B v0.1 Signal used: HellaSwag 93.8%	HellaSwag 93.8%	64k	$0.65	$0.65
8	Falcon 180B Signal used: HellaSwag 92.7%	HellaSwag 92.7%	—	—	—
9	Gemma 2 27B Signal used: HellaSwag 92.6%	HellaSwag 92.6%	8k	$0.08	$0.24
10	GPT-5.5 ReasoningVisionTools Signal used: MMLU 92.4%	MMLU 92.4%	1.05m	$5.00	$30.00
11	Llama 3 70B Signal used: HellaSwag 92.4%	HellaSwag 92.4%	8k	$0.65	$2.75
12	Qwen2-7B Signal used: HellaSwag 92%	HellaSwag 92%	128k	$0.05	$0.15
13	Gemini 3 Pro VisionTools Signal used: MMLU-Pro 91.8%	MMLU-Pro 91.8%	1m	$1.25	$5.00
14	Mistral NeMo Instruct (2407) Signal used: HellaSwag 91.8%	HellaSwag 91.8%	128k	$0.02	$0.04
15	DeepSeek Coder V2 Lite Signal used: HellaSwag 91.4%	HellaSwag 91.4%	128k	$0.50	$0.50
16	Claude Opus 4.6 ReasoningVisionTools Signal used: MMLU 91.1%	MMLU 91.1%	1m	$5.00	$25.00
17	Llama 3 8B Instruct Signal used: HellaSwag 91.1%	HellaSwag 91.1%	8k	$0.02	$0.04
18	Mixtral 8x7B Signal used: HellaSwag 90.9%	HellaSwag 90.9%	32k	$0.15	$0.20
19	Phi-3 Small 128K Signal used: HellaSwag 90.8%	HellaSwag 90.8%	128k	$0.35	$1.05
20	Mistral 7B Instruct v0.3 Tools Signal used: HellaSwag 90.2%	HellaSwag 90.2%	32k	$0.20	$0.20

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

#4Gemini 3 Pro
Google DeepMind's most advanced reasoning Gemini model. Part of the Gemini 3 series with frontier-class intelligence, multimodal understanding, and 1M token context window.
91.8%
MMLU-Pro
#5Mistral NeMo Instruct (2407)
Mistral NeMo Instruct (2407) is MistralAI's Mistral NeMo model. It offers a 128K-token context window and scores 57.1 on GPQA.
91.8%
HellaSwag
#6DeepSeek Coder V2 Lite
DeepSeek Coder V2 Lite is an open-source Mixture-of-Experts (MoE) language model specifically tailored for efficiency and cost-effectiveness in coding tasks. It operates with a 15.7B parameter count, but only 2.4B are active at any given time, making it comparable to GPT4-Turbo for code-centric applications. This model supports 338 programming languages and has an extended context length of 128K tokens, facilitating the handling of complex codebases and lengthy prompts. Its features encompass code generation, completion, understanding, and mathematical reasoning, making it versatile for diverse coding applications. Available on Hugging Face, Ollama, and other platforms, DeepSeek Coder V2 Lite offers accessible solutions for developers and researchers, with performance that rivals or surpasses some closed-source models.
91.4%
HellaSwag

Compare Top Picks

Side-by-side comparison of the top picks by price, benchmark, and API access.

Gemini 3.1 Pro Preview vs Llama 3.1 405B Gemini 3.1 Pro Preview vs DeepSeek V3 Gemini 3.1 Pro Preview vs Qwen2.5-72B-Instruct Gemini 3.1 Pro Preview vs Llama 3.1 70B Instruct Llama 3.1 405B vs DeepSeek V3 Llama 3.1 405B vs Qwen2.5-72B-Instruct

Browse Other Categories

Best LLMs for Code Generation Best LLMs for RAG Best AI Agents & Agentic Models Best Open Source LLMs Best Multimodal / Vision LLMs Best LLMs for Reasoning & Math Best Small Language Models (SLMs)Best LLMs for Function Calling & Tool Use Cheapest LLM APIs You Can Call Right Now Best Long Context LLMs Best Mainstream LLM APIs, Ranked Best LLMs for Enterprise Best Free LLMs You Can Use Right Now Best LLMs for Writing Best LLMs for Marketing Best LLMs for Customer Support