Qiskit-HumanEval-Hard
Leaderboard
151 challenging quantum programming tasks testing code generation, correctness, and execution. Models are evaluated on their ability to generate valid, runnable Qiskit code that passes simulation verification.
Model Rankings
Open-Source Evaluation Framework
GrayBench is a reproducible benchmarking suite developed by GrayArea Labs. Every benchmark run executes LLM-generated code in an isolated environment and validates it against real test cases. No self-reported scores—every result comes from actually running the code.
Verified Execution
Every submission runs in an isolated virtual environment with Qiskit 2.0. A task only passes if the code executes without errors and all assertions succeed.
Open Source
The code, datasets, and evaluation logic are all open source. Anyone can verify published scores, reproduce results, or contribute new benchmarks.
Isolated Environments
Each benchmark runs in its own virtual environment with pinned dependencies. Qiskit 2.0.0, PyTorch, and all required packages are isolated from the host system.
Supported Evaluations
GrayBench supports multiple benchmarks covering quantum computing and physics. Each benchmark uses real datasets from HuggingFace with verified test cases.
Run Your Own Benchmarks
Install GrayBench and start evaluating LLMs on quantum code generation. The entire process—from setup to results—takes less than 10 minutes.
$ git clone https://github.com/GrayArea-Labs/GrayBench.git
$ cd graybench && pip install -e .
$ graybench env setup qiskitbench
$ graybench keys set google
$ graybench run qiskitbench-hard -m google/gemini-2.5-flash