Gemma 2 27B
- HellaSwag
- 92.6%
- Output (from)
- $0.240 / 1M
Last refreshed 2026-06-27. Next refresh: weekly.
Compare models for routing, moderation, extraction, safety labels, and structured classification by sourced benchmark coverage and pricing.
Verdict
GPT-5.5 is the runner-up; compare HellaSwag against MMLU.
Classification picks take the strongest score across MMLU-Pro, MMLU, and lighter classification benchmarks, then recency.
| # | Model | Input $/1M | Output $/1M | |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview PreviewVisionTools Signal used: MMLU 98% | $2.00 | $12.00 | |
| 2 | Llama 3.1 405B Signal used: HellaSwag 95.8% | — | — | |
| 3 | DeepSeek V3 Tools Signal used: HellaSwag 95.7% | $0.10 | $0.28 | |
| 4 | Qwen2.5-72B-Instruct Signal used: HellaSwag 95.6% | $0.18 | $0.28 | |
| 5 | Llama 3.1 70B Instruct Signal used: HellaSwag 94.2% | $0.40 | $0.40 | |
| 6 | Mistral Large 2 VisionTools Signal used: HellaSwag 93.8% | $0.48 | $2.40 | |
| 7 | Mixtral 8x22B v0.1 Signal used: HellaSwag 93.8% | $0.65 | $0.65 | |
| 8 | Falcon 180B Signal used: HellaSwag 92.7% | — | — | |
| 9 | Gemma 2 27B Signal used: HellaSwag 92.6% | $0.08 | $0.24 | |
| 10 | GPT-5.5 ReasoningVisionTools Signal used: MMLU 92.4% | $5.00 | $30.00 | |
| 11 | Llama 3 70B Signal used: HellaSwag 92.4% | $0.65 | $2.75 | |
| 12 | Qwen2-7B Signal used: HellaSwag 92% | $0.05 | $0.15 | |
| 13 | Gemini 3 Pro VisionTools Signal used: MMLU-Pro 91.8% | $1.25 | $5.00 | |
| 14 | Mistral NeMo Instruct (2407) Signal used: HellaSwag 91.8% | $0.02 | $0.04 | |
| 15 | DeepSeek Coder V2 Lite Signal used: HellaSwag 91.4% | $0.50 | $0.50 | |
| 16 | Claude Opus 4.6 ReasoningVisionTools Signal used: MMLU 91.1% | $5.00 | $25.00 | |
| 17 | Llama 3 8B Instruct Signal used: HellaSwag 91.1% | $0.02 | $0.04 | |
| 18 | Mixtral 8x7B Signal used: HellaSwag 90.9% | $0.15 | $0.20 | |
| 19 | Phi-3 Small 128K Signal used: HellaSwag 90.8% | $0.35 | $1.05 | |
| 20 | Mistral 7B Instruct v0.3 Tools Signal used: HellaSwag 90.2% | $0.20 | $0.20 |
Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.
Google DeepMind's most advanced reasoning Gemini model. Part of the Gemini 3 series with frontier-class intelligence, multimodal understanding, and 1M token context window.
91.8%
MMLU-Pro
Mistral NeMo Instruct (2407) is MistralAI's Mistral NeMo model. It offers a 128K-token context window and scores 57.1 on GPQA.
91.8%
HellaSwag
DeepSeek Coder V2 Lite is an open-source Mixture-of-Experts (MoE) language model specifically tailored for efficiency and cost-effectiveness in coding tasks. It operates with a 15.7B parameter count, but only 2.4B are active at any given time, making it comparable to GPT4-Turbo for code-centric applications. This model supports 338 programming languages and has an extended context length of 128K tokens, facilitating the handling of complex codebases and lengthy prompts. Its features encompass code generation, completion, understanding, and mathematical reasoning, making it versatile for diverse coding applications. Available on Hugging Face, Ollama, and other platforms, DeepSeek Coder V2 Lite offers accessible solutions for developers and researchers, with performance that rivals or surpasses some closed-source models.
91.4%
HellaSwag
Side-by-side comparison of the top picks by price, benchmark, and API access.