2601.21448v2 Jan 29, 2026 cs.AI

ChipBench: AI 기반 칩 설계 성능 평가를 위한 차세대 벤치마크

ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

Zhongkai Yu

Citations: 26

h-index: 3

Yufei Ding

Citations: 28

h-index: 3

Chenyang Zhou

Citations: 7

h-index: 2

Hejia Zhang

Citations: 167

h-index: 5

Haotian Ye

Citations: 6

h-index: 1

Jishen Zhao

Citations: 37

h-index: 3

Junxia Cui

Citations: 19

h-index: 3

Zaifeng Pan

Citations: 103

h-index: 5

Yichen Lin

Citations: 7

h-index: 2

대규모 언어 모델(LLM)은 하드웨어 엔지니어링 분야에서 상당한 잠재력을 보여주지만, 현재 벤치마크는 포화 상태에 있으며 작업의 다양성이 제한되어 실제 산업 환경에서의 LLM 성능을 제대로 반영하지 못합니다. 이러한 문제를 해결하기 위해, 우리는 AI 기반 칩 설계 성능을 엄격하게 평가하는 포괄적인 벤치마크를 제안합니다. 이 벤치마크는 Verilog 코드 생성, 디버깅, 참조 모델 생성이라는 세 가지 중요한 작업 영역을 포함합니다. 벤치마크는 복잡한 계층 구조를 가진 44개의 실제 모듈, 89개의 체계적인 디버깅 사례, 그리고 Python, SystemC, CXXRTL을 사용하는 132개의 참조 모델 샘플을 특징으로 합니다. 평가 결과는 상당한 성능 격차를 보여줍니다. 최첨단 모델인 Claude-4.5-opus는 Verilog 코드 생성에서 30.74%, Python 참조 모델 생성에서 13.33%의 정확도를 기록했으며, 이는 기존 벤치마크에서 최첨단 모델이 95% 이상의 성공률을 보이는 것과 비교하면 상당한 어려움을 나타냅니다. 또한, LLM 참조 모델 생성 능력을 향상시키기 위해, 고품질 학습 데이터 생성 자동화 도구를 제공하여 이 분야의 추가 연구를 촉진합니다. 저희 코드는 https://github.com/zhongkaiyu/ChipBench.git 에서 확인할 수 있습니다.

Original Abstract

While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI-aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state-of-the-art Claude-4.5-opus achieving only 30.74\% on Verilog generation and 13.33\% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95\% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high-quality training data generation, facilitating future research in this underexplored domain. Our code is available at https://github.com/zhongkaiyu/ChipBench.git.

3 Citations

0 Influential

22.5 Altmetric

115.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!