2601.02430v1 Jan 05, 2026 cs.SE

WebCoderBench: 포괄적이고 해석 가능한 평가 지표를 활용한 웹 애플리케이션 생성 벤치마킹

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Chenxu Liu

Citations: 49

h-index: 3

Yingjie Fu

Citations: 22

h-index: 3

Wei Yang

Citations: 97

h-index: 6

Ying Zhang

Citations: 18

h-index: 3

Tao Xie

Citations: 9

h-index: 2

웹 애플리케이션(웹 앱)은 대규모 언어 모델(LLM)이 코드 생성 능력과 상업적 잠재력을 입증하는 핵심 영역으로 자리 잡았습니다. 그러나 LLM이 생성한 웹 앱에 대한 벤치마크를 구축하는 것은 실제 사용자 요구 사항, 정답 구현 또는 테스트 케이스에 의존하지 않는 일반화 가능한 평가 지표, 그리고 해석 가능한 평가 결과가 필요하기 때문에 여전히 어려운 과제입니다. 이러한 어려움을 해결하기 위해, 우리는 실제 환경에서 수집되었으며, 일반화 가능하고, 해석 가능한 웹 앱 생성 벤치마크인 WebCoderBench를 소개합니다. WebCoderBench는 1,572개의 실제 사용자 요구 사항으로 구성되어 있으며, 이는 다양한 모달리티와 표현 방식을 포괄하여 현실적인 사용자 의도를 반영합니다. WebCoderBench는 9가지 관점에서 24가지 세분화된 평가 지표를 제공하며, 규칙 기반 방법과 LLM-as-a-judge 패러다임을 결합하여 완전 자동화된 객관적이고 일반적인 평가를 수행합니다. 또한, WebCoderBench는 사용자의 선호도에 부합하는 가중치를 평가 지표에 적용하여 해석 가능한 전체 점수를 제공합니다. 12개의 대표적인 LLM과 2개의 LLM 기반 에이전트를 대상으로 수행한 실험 결과, 모든 평가 지표에서 특정 모델이 우세하지 않으며, 이는 LLM 개발자들이 모델을 특정 영역에 맞게 최적화하여 더욱 강력한 버전을 개발할 수 있는 기회를 제공합니다.

Original Abstract

Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable evaluation results. To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation. WebCoderBench comprises 1,572 real user requirements, covering diverse modalities and expression styles that reflect realistic user intentions. WebCoderBench provides 24 fine-grained evaluation metrics across 9 perspectives, combining rule-based and LLM-as-a-judge paradigm for fully automated, objective, and general evaluation. Moreover, WebCoderBench adopts human-preference-aligned weights over metrics to yield interpretable overall scores. Experiments across 12 representative LLMs and 2 LLM-based agents show that there exists no dominant model across all evaluation metrics, offering an opportunity for LLM developers to optimize their models in a targeted manner for a more powerful version.

3 Citations

0 Influential

3 Altmetric

18.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!