Question 1

What does the SWE-rebench benchmark measure?

Accepted Answer

Evaluates LLM coding agents on real-world GitHub issues sourced after each model's training cutoff, preventing benchmark contamination. Uses standardized ReAct scaffolding with 128K token context; each model is run five times per problem and the best Pass@1 resolved rate is reported. On this page it ranks 14 tracked models where higher is better.

Question 2

Is a higher SWE-rebench score always better?

Accepted Answer

For this benchmark, higher is better. A high score helps you shortlist, but confirm pricing, context window, and provider availability on each model page before committing — the top scorer is not always the right pick for your workload or budget.

Question 3

How current is this SWE-rebench data?

Accepted Answer

This benchmark was last reviewed on May 28, 2026. The tracked score average moved +21.96 points across the last 2 snapshots.

SWE-rebench

Leaderboard

How to read this benchmark

FAQ

Related benchmarks

Resources