Home Benchmarks Get Early Access
Live Results

Qiskit-HumanEval-Hard
Leaderboard

151 challenging quantum programming tasks testing code generation, correctness, and execution. Models are evaluated on their ability to generate valid, runnable Qiskit code that passes simulation verification.

5
Models Tested
151
Total Tasks
Qiskit 2.0
Version
Zero
Temperature
LEADERBOARD

Model Rankings

#1
GrayGate Our System
wrapped around gemini-3-flash
131 / 151
86.8% pass rate
#2
gemini-3-pro-preview
90 / 151
59.6% pass rate
#3
gemini-3-flash-preview
80 / 151
53.0% pass rate
#4
gpt-5.2
66 / 151
43.7% pass rate
#5
deepseek-reasoner
55 / 151
36.4% pass rate
ABOUT GRAYBENCH

Open-Source Evaluation Framework

GrayBench is a reproducible benchmarking suite developed by GrayArea Labs. Every benchmark run executes LLM-generated code in an isolated environment and validates it against real test cases. No self-reported scores—every result comes from actually running the code.

Verified Execution

Every submission runs in an isolated virtual environment with Qiskit 2.0. A task only passes if the code executes without errors and all assertions succeed.

Open Source

The code, datasets, and evaluation logic are all open source. Anyone can verify published scores, reproduce results, or contribute new benchmarks.

Isolated Environments

Each benchmark runs in its own virtual environment with pinned dependencies. Qiskit 2.0.0, PyTorch, and all required packages are isolated from the host system.

BENCHMARKS

Supported Evaluations

GrayBench supports multiple benchmarks covering quantum computing and physics. Each benchmark uses real datasets from HuggingFace with verified test cases.

Benchmark
Dataset
Tasks
Focus Area
qiskitbench-hard Current Leaderboard
151
Complex quantum algorithms, transpilation, error correction
qiskitbench Standard
99
Basic quantum computing with Qiskit 2.0
critpt Coming Soon
CritPt-Benchmark/CritPt
190
Research-level physics (local execution only)
GET STARTED

Run Your Own Benchmarks

Install GrayBench and start evaluating LLMs on quantum code generation. The entire process—from setup to results—takes less than 10 minutes.

terminal

$ git clone https://github.com/GrayArea-Labs/GrayBench.git

$ cd graybench && pip install -e .

$ graybench env setup qiskitbench

$ graybench keys set google

$ graybench run qiskitbench-hard -m google/gemini-2.5-flash