2604.14137v1 Apr 15, 2026 cs.CL

감정에서 지표로: 사용자들이 LLM을 어떻게 '분위기 테스트'하는지 이해하고 체계화하기

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Eliya Habba

Citations: 61

h-index: 5

Yonatan Belinkov

Citations: 272

h-index: 9

Itay Itzhak

Technion, Hebrew University of Jerusalem

Citations: 191

h-index: 6

Gabriel Stanovsky

Citations: 48

h-index: 4

LLM을 평가하는 것은 어렵습니다. 벤치마크 점수는 종종 모델의 실제 유용성을 제대로 반영하지 못하기 때문입니다. 대신, 사용자들은 종종 '분위기 테스트'라는 비공식적이고 경험 기반의 평가 방법을 사용합니다. 예를 들어, 자신의 업무 흐름과 관련된 코딩 작업에서 여러 모델을 비교하는 방식입니다. 하지만 이러한 분위기 테스트는 종종 너무 즉흥적이고 체계적이지 않아 분석하거나 대규모로 재현하기 어렵습니다. 본 연구에서는 실제 분위기 테스트가 어떻게 이루어지는지 연구하고, 이를 체계적인 분석을 지원할 수 있도록 형식화합니다. 먼저, 사용자 평가 방식에 대한 설문 조사 및 블로그 및 소셜 미디어에서 수집된 실제 모델 비교 보고서라는 두 가지 실증적 자료를 분석합니다. 이러한 자료를 바탕으로, 분위기 테스트를 사용자가 테스트하는 내용과 응답을 평가하는 방식을 모두 개인화하는 두 부분으로 구성된 프로세스로 형식화합니다. 그런 다음, 개인화된 프롬프트를 생성하고 사용자 중심의 주관적인 기준을 사용하여 모델 출력을 비교하는 개념 증명 평가 파이프라인을 소개합니다. 코딩 벤치마크 실험에서, 개인화된 프롬프트와 사용자 중심의 평가를 결합하면 어떤 모델이 선호되는지가 달라질 수 있으며, 이는 실제 분위기 테스트의 역할을 반영합니다. 이러한 결과는 형식화된 분위기 테스트가 벤치마크 점수와 실제 경험 사이의 격차를 줄이는 데 유용한 접근 방식이 될 수 있음을 시사합니다.

Original Abstract

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!