LLM Reference
Coding

SWE-bench Verified

About

Real-world GitHub issue resolution benchmark with 500 human-verified tasks. Measures end-to-end software engineering ability — the percentage of issues resolved correctly by an AI agent.