QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
2026-04-14
Summary
This paper investigates how well large language models, which are good at writing code, can generate code for quantum computers. It focuses on whether these models are actually *understanding* quantum computing concepts or just memorizing how to write code for specific quantum programming tools.
What's the problem?
Currently, evaluating a language model's ability to do quantum code generation is tricky because most tests are designed for only one specific quantum programming framework, like Qiskit. This makes it hard to tell if the model is succeeding because it understands quantum mechanics, or because it's simply familiar with the way that particular framework works. It's like testing a student on a specific textbook instead of the underlying subject matter.
What's the solution?
The researchers created a new, more comprehensive testing suite called QuanBench+. This suite includes problems that can be solved using three different popular quantum programming frameworks: Qiskit, PennyLane, and Cirq. They gave the language models these problems and checked if the generated code actually *worked* using functional tests. They measured how often the models got the right answer on the first try (Pass@1 and Pass@5) and also tested if the models could improve their answers if given feedback after an error. They used a special method to check answers that involve probabilities.
Why it matters?
This work is important because it shows that while language models are getting better at quantum code generation, they still heavily rely on knowing the specifics of each programming framework. It highlights that true quantum reasoning isn't fully solved yet. Developing models that genuinely understand quantum computing, rather than just mimicking code, is crucial for making quantum computers more accessible and useful.
Abstract
Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.