2604.02729v1 Apr 03, 2026 cs.SE

IndustryCode: 산업 코드 생성 벤치마크

IndustryCode: A Benchmark for Industry Code Generation

Shaobo Wang

Citations: 402

h-index: 9

Linfeng Zhang

Citations: 56

h-index: 4

Bing Zhao

Citations: 26

h-index: 3

Zhaoxia Wang

Citations: 58

h-index: 4

Pu Zeng

Citations: 14

h-index: 1

Zhixu Duan

Citations: 0

h-index: 0

Liang Feng

Citations: 16

h-index: 3

Cunxiang Wang

Citations: 34

h-index: 2

Jinghang Wang

Citations: 139

h-index: 6

Hu Wei

Citations: 1

h-index: 1

대규모 언어 모델(LLM)의 코드 생성 및 이해 능력은 산업 지능 및 의사 결정 최적화의 핵심 동력으로 부상했으며, 금융, 자동화, 항공우주 등 다양한 분야에서 널리 활용되고 있습니다. 최근 LLM의 발전은 일반적인 코드 생성 분야에서 놀라운 잠재력을 보여주었지만, 기존 벤치마크는 주로 단일 도메인 및 언어에 국한되어 있습니다. 따라서, 실제 산업 응용에 필요한 일반화 능력이나 복잡한 산업 시나리오에서 요구되는 코딩 역량을 효과적으로 평가하는 데 한계가 있습니다. 이러한 격차를 해소하기 위해, 우리는 다양한 산업 도메인 및 프로그래밍 언어를 포괄하는 최초의 종합 벤치마크인 IndustryCode를 소개합니다. IndustryCode는 125개의 주요 산업 문제를 기반으로 한 579개의 하위 문제로 구성되어 있으며, 각 문제에 대한 엄격한 설명과 테스트 케이스가 함께 제공됩니다. 이 벤치마크는 금융, 자동화, 항공우주, 원격 감지 등 광범위한 분야를 포함하며, MATLAB, Python, C++, Stata 등 다양한 프로그래밍 언어를 지원합니다. 평가 결과, 가장 높은 성능을 보인 모델인 Claude 4.5 Opus는 하위 문제에서 68.1%, 주요 문제에서 42.5%의 정확도를 달성했습니다. 벤치마크 데이터셋 및 자동화된 평가 코드는 논문 게재 승인 후 공개될 예정입니다.

Original Abstract

Code generation and comprehension by Large Language Models (LLMs) have emerged as core drivers of industrial intelligence and decision optimization, finding widespread application in fields such as finance, automation, and aerospace. Although recent advancements have demonstrated the remarkable potential of LLMs in general code generation, existing benchmarks are mainly confined to single domains and languages. Consequently, they fail to effectively evaluate the generalization capabilities required for real-world industrial applications or to reflect the coding proficiency demanded by complex industrial scenarios. To bridge this gap, we introduce IndustryCode, the first comprehensive benchmark designed to span multiple industrial domains and programming languages. IndustryCode comprises 579 sub-problems derived from 125 primary industrial challenges, accompanied by rigorous problem descriptions and test cases. It covers a wide range of fields, including finance, automation, aerospace, and remote sensing-and incorporates diverse programming languages such as MATLAB, Python, C++, and Stata. In our evaluation, the top-performing model, Claude 4.5 Opus, achieved an overall accuracy of 68.1% on sub-problems and 42.5% main problems. The benchmark dataset and automated evaluation code will be made publicly available upon acceptance.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!