2604.27637v1 Apr 30, 2026 cs.AI

평가 전 최적화: 최적화되지 않은 프롬프트를 사용한 평가는 오해를 불러일으킬 수 있습니다.

Optimization before Evaluation: Evaluation with Unoptimised Prompts Can be Misleading

Yifan Mai

Stanford University

Citations: 2,491

h-index: 15

Daniel Dahlmeier

Citations: 9

h-index: 2

Nicholas Sadjoli

Citations: 2

h-index: 1

Tim Siefken

Citations: 2

h-index: 1

Atin Ghosh

Citations: 60

h-index: 3

현재 대규모 언어 모델(LLM) 평가 프레임워크는 평가 대상 모든 모델에 동일한 정적 프롬프트 템플릿을 사용합니다. 이는 업계에서 일반적으로 사용되는 프롬프트 최적화(PO) 기술을 사용하여 각 모델의 성능을 극대화하는 것과는 다릅니다. 본 논문에서는 PO가 LLM 평가에 미치는 영향을 조사합니다. 공개된 학술 및 내부 산업 벤치마크 결과에 따르면 PO는 모델의 최종 순위에 큰 영향을 미칩니다. 이는 특정 작업에 가장 적합한 모델을 선택하기 위해 평가를 수행할 때 실무자가 각 모델에 대해 PO를 수행하는 것이 중요하다는 것을 강조합니다.

Original Abstract

Current Large Language Model (LLM) evaluation frameworks utilize the same static prompt template across all models under evaluation. This differs from the common industry practice of using prompt optimization (PO) techniques to optimize the prompt for each model to maximize application performance. In this paper, we investigate the effect of PO towards LLM evaluations. Our results on public academic and internal industry benchmarks show that PO greatly affects the final ranking of models. This highlights the importance of practitioners performing PO per model when conducting evaluations to choose the best model for a given task.

2 Citations

0 Influential

7.5 Altmetric

39.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!