2604.17768v1 Apr 20, 2026 cs.AI

비전-언어 모델이 보지 않고 판단할 때: 정보성 편향에 대한 분석

When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias

Dan Roth

Citations: 29

h-index: 3

Xiaohan Zou

Citations: 14

h-index: 2

Mohammad Safarzadeh

Citations: 9

h-index: 2

R. Sridhar

Citations: 9

h-index: 2

비전-언어 모델(VLM)을 평가하는 데 있어 VLM-as-a-Judge의 신뢰성은 매우 중요합니다. 최근의 발전에도 불구하고, 우리의 분석 결과 VLM-as-a-Judge는 의사 결정을 내릴 때 이미지에 충분한 주의를 기울이지 않는 경우가 많습니다. 오히려, VLM은 이미지 내용과 충돌하는 경우에도 더 많은 정보를 담고 있는 답변을 무조건 선호하는 경향이 있습니다. 우리는 이러한 현상을 '정보성 편향'이라고 부르며, 이는 평가의 신뢰성을 크게 저해합니다. 이러한 문제를 해결하기 위해, 우리는 후보 답변에서 이미지 내용과의 불일치를 먼저 수정하고, 수정된 버전을 기준으로 답변을 비교하는 새로운 평가 방식인 BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor)를 제안합니다. 이는 평가자의 초점을 정보성에서 이미지 기반의 정확성으로 이동시키는 것을 목표로 합니다. 여러 모델과 벤치마크에 대한 실험 결과, BIRCH는 정보성 편향을 최대 17%까지 감소시켜, 최대 9.8%의 성능 향상을 가져왔습니다. 본 연구는 현재 VLM-as-a-Judge 시스템에 내재된 간과되었지만 근본적인 결함을 드러내며, 더욱 체계적인 설계의 필요성을 강조합니다.

Original Abstract

The reliability of VLM-as-a-Judge is critical for the automatic evaluation of vision-language models (VLMs). Despite recent progress, our analysis reveals that VLM-as-a-Judge often pays limited attention to the image when making decisions. Instead, they often blindly favor the more informative answer, even when they can recognize it conflicts with the image content. We call this problem informativeness bias, which significantly undermines judge reliability. To address it, we propose BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor), a judging paradigm that first corrects inconsistencies with the image content in candidate answers, and then compares the answers against this corrected version. This shifts the judge's focus from informativeness to image-grounded correctness. Experiments on multiple models and benchmarks show that BIRCH reduces informativeness bias by up to 17%, resulting in performance gains of up to 9.8%. Our work reveals an overlooked but fundamental flaw in current VLM-as-a-Judge systems and highlights the need for more principled designs.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!