2601.21070v1 Jan 28, 2026 cs.SE

소프트웨어 공학 분야의 LLM을 위한 포괄적인 벤치마킹 인프라 구축

Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering

Xiaochang Li

Citations: 44

h-index: 3

Huajie Shao

Citations: 64

h-index: 5

Dipin Khati

Citations: 39

h-index: 3

Daniel Rodríguez-Cárdenas

Citations: 89

h-index: 6

Denys Poshyvanyk

Citations: 182

h-index: 9

Marcos Macedo

Citations: 110

h-index: 4

A. Mastropaolo

Citations: 1,396

h-index: 14

Yuan Tian

Citations: 74

h-index: 3

코드 생성 및 분석을 위한 대규모 언어 모델(LLM)은 빠르게 발전하고 있지만, 이러한 모델을 평가하는 능력은 여전히 부족합니다. 현재 벤치마크는 제한적인 작업과 단일 지표에 초점을 맞추어 견고성, 해석 가능성, 공정성, 효율성 및 실제 사용 가능성 측면에서 중요한 격차를 숨깁니다. 또한, 일관성 없는 데이터 처리 방식, 제한적인 소프트웨어 공학적 맥락, 그리고 광범위한 데이터 오염 문제로 인해 어려움을 겪고 있습니다. 이러한 문제점을 이해하고 개선 방향을 제시하기 위해, 기존 벤치마크에 대한 심층적인 조사와 전용 커뮤니티 워크숍에서 얻은 통찰력을 결합했습니다. 우리는 신뢰할 수 있는 평가를 가로막는 세 가지 핵심 장벽을 확인했습니다. 즉, 소프트웨어 공학적 지식이 풍부한 데이터셋의 부재, 머신러닝 중심 지표에 대한 과도한 의존, 그리고 표준화되고 재현 가능한 데이터 파이프라인의 부족입니다. 이러한 결과를 바탕으로, 소프트웨어 시나리오 명세와 다중 지표 평가를 통합하는 종합적인 벤치마킹 인프라인 BEHELM을 소개합니다. BEHELM은 작업, 언어, 입력 및 출력 세분성, 그리고 주요 품질 측면 전반에 걸쳐 모델을 평가할 수 있는 체계적인 방법을 제공합니다. 우리의 목표는 벤치마크 구축에 필요한 부담을 줄이면서 소프트웨어 공학 분야의 LLM을 공정하고 현실적이며 미래 지향적인 방식으로 평가할 수 있도록 하는 것입니다.

Original Abstract

Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, interpretability, fairness, efficiency, and real-world usability. They also suffer from inconsistent data engineering practices, limited software engineering context, and widespread contamination issues. To understand these problems and chart a path forward, we combined an in-depth survey of existing benchmarks with insights gathered from a dedicated community workshop. We identified three core barriers to reliable evaluation: the absence of software-engineering-rich datasets, overreliance on ML-centric metrics, and the lack of standardized, reproducible data pipelines. Building on these findings, we introduce BEHELM, a holistic benchmarking infrastructure that unifies software-scenario specification with multi-metric evaluation. BEHELM provides a structured way to assess models across tasks, languages, input and output granularities, and key quality dimensions. Our goal is to reduce the overhead currently required to construct benchmarks while enabling a fair, realistic, and future-proof assessment of LLMs in software engineering.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!