HumanEval benchmark

%

OpenAI code generation benchmark — pass@1 on 164 Python problems, 0-100%

What is humaneval benchmark?

HumanEval tests code generation — 164 Python programming problems written by OpenAI researchers. A model passes a problem if its generated code passes all unit tests.

Why it matters

HumanEval is the standard benchmark for code generation capability. If you are building a coding assistant, code review tool, or any application that generates code, HumanEval scores give you a baseline for comparing models. Higher scores correlate with better real-world code generation.

Where models stand

  1. #1 o1
    92.4 %
  2. 92 %
  3. 91.6 %
  4. 90.2 %
  5. 89 %

Data available for 15 of 30 tracked models.

How sourc.dev tracks this

sourc.dev verifies humaneval benchmark manually from official provider documentation, API responses, and published specifications. Every data point includes a source URL and verification date. When a value changes, the old value is preserved in the history table and the new value is recorded alongside it. Nothing is overwritten — the full timeline is always available.

Related

Frequently asked questions

What programming language does HumanEval use?

Python. All 164 problems are Python functions with docstrings, and the model must generate correct Python code that passes the provided unit tests. Some extended versions (HumanEval+, MultiPL-E) test other languages.

What does pass@1 mean?

pass@1 is the probability that a single generated solution passes all tests. It is the strictest measure — one attempt, pass or fail. Some benchmarks also report pass@10 or pass@100, which allow multiple attempts.

Is HumanEval representative of real coding tasks?

HumanEval covers algorithmic and utility functions. It does not test full-application development, debugging, refactoring, or working with large codebases. It is a useful signal for code generation ability but not a complete picture of coding capability.

← Back to Knowledge Library