2605.25572v1 May 25, 2026 cs.CL

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

Minghao Shao
Minghao Shao
Citations: 454
h-index: 11
Muhammad Kashif
Muhammad Kashif
New York University Abu Dhabi
Citations: 281
h-index: 11
Nouhaila Innan
Nouhaila Innan
Citations: 25
h-index: 3
Alberto Marchisio
Alberto Marchisio
Technische Universität Wien (TU Wien)
Citations: 1,877
h-index: 24
Hariharan Janardhanan
Hariharan Janardhanan
Citations: 0
h-index: 0
Muhammad Shafique
Muhammad Shafique
Citations: 223
h-index: 10

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device configurations, and produce structurally invalid circuits when faced with specialized quantum coding challenges. We present PennySynth, a retrieval-augmented generation framework that addresses this gap by conditioning LLM inference on a curated knowledge base of 13,389 PennyLane instruction-code pairs, built via a three-stage extraction, verification, and deduplication pipeline over official PennyLane repositories, community GitHub sources, and QHack competition archives. PennySynth introduces a code-aware embedding strategy using st-codesearch-distilroberta-base, trained for natural-language-to-code retrieval, increasing average retrieval cosine similarity from 0.45 to 0.726 compared to a general-purpose baseline. Evaluated across 74 challenges spanning three years of the QHack competition (2022, 2023, 2024), PennySynth achieves 64%, 68%, and 52% pass@5 on QHack 2022, 2023, and 2024, respectively, improving over Claude Sonnet 4.6 without retrieval by +28, +25, and +28 percentage points. We further introduce a quantum-adapted CodeBLEU metric that upweights qml.* token patterns and show that structural code similarity and functional correctness capture distinct aspects of quantum code quality. Controlled ablations reveal that code-aware embeddings are the primary driver of retrieval performance, while dataset expansion and source composition provide additional gains when retrieval quality is sufficiently precise.

0 Citations
0 Influential
12 Altmetric
60.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!