2601.21448v1 Jan 29, 2026 cs.AI

ChipBench: AI 보조 칩 설계에서의 LLM 성능 평가를 위한 차세대 벤치마크

ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

Zhongkai Yu

Citations: 18

h-index: 1

Yufei Ding

Citations: 19

h-index: 1

Chenyang Zhou

Citations: 2

h-index: 1

Hejia Zhang

Citations: 133

h-index: 4

Haotian Ye

Citations: 3

h-index: 1

Jishen Zhao

Citations: 13

h-index: 2

Junxia Cui

Citations: 13

h-index: 2

Zaifeng Pan

Citations: 67

h-index: 4

Yichen Lin

Citations: 3

h-index: 1

대규모 언어 모델(LLM)이 하드웨어 엔지니어링에서 상당한 잠재력을 보여주고 있지만, 현재의 벤치마크들은 포화 상태와 제한적인 작업 다양성 문제를 겪고 있어 실제 산업 워크플로우에서의 LLM 성능을 제대로 반영하지 못하고 있습니다. 이러한 격차를 해소하기 위해, 본 연구에서는 Verilog 생성, 디버깅, 참조 모델 생성이라는 세 가지 핵심 작업에 걸쳐 LLM을 엄격하게 평가하는 포괄적인 AI 보조 칩 설계 벤치마크를 제안합니다. 이 벤치마크는 복잡한 계층 구조를 가진 44개의 실제 모듈, 89개의 체계적인 디버깅 케이스, 그리고 Python, SystemC, CXXRTL을 아우르는 132개의 참조 모델 샘플로 구성되어 있습니다. 평가 결과, 최첨단 모델인 Claude-4.5-opus가 Verilog 생성에서 30.74%, Python 참조 모델 생성에서 13.33%의 성공률만을 기록하며 상당한 성능 격차를 드러냈습니다. 이는 SOTA 모델들이 95% 이상의 통과율을 보이는 기존의 포화된 벤치마크와 비교하여 중대한 도전 과제가 존재함을 시사합니다. 또한, LLM의 참조 모델 생성 성능 향상을 지원하기 위해 고품질 학습 데이터 생성을 위한 자동화 툴박스를 제공하여, 이 미개척 분야의 후속 연구를 촉진하고자 합니다. 코드는 https://github.com/zhongkaiyu/ChipBench.git 에서 공개되어 있습니다.

Original Abstract

While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI-aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state-of-the-art Claude-4.5-opus achieving only 30.74\% on Verilog generation and 13.33\% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95\% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high-quality training data generation, facilitating future research in this underexplored domain. Our code is available at https://github.com/zhongkaiyu/ChipBench.git.

1 Citations

0 Influential

22 Altmetric

111.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!