2601.04946v2 Jan 08, 2026 cs.CV

전형성 편향이 다중 모드 평가 지표의 한계를 드러낸다

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Subhadeep Roy

Citations: 68

h-index: 2

Gagan Bhatia

Citations: 203

h-index: 7

Steffen Eger

Citations: 170

h-index: 7

자동화된 지표는 현재 텍스트-이미지 모델 평가에 핵심적인 역할을 하며, 벤치마킹 및 대규모 필터링 과정에서 인간의 판단을 대체하는 경우가 많습니다. 그러나 이러한 지표들이 실제로 의미론적 정확성을 우선시하는지, 아니면 편향된 데이터 분포에서 학습된 시각적 및 사회적으로 전형적인 이미지를 선호하는지에 대해서는 여전히 불확실합니다. 본 연구에서는 다중 모드 평가에서 발생하는 체계적인 오류인 전형성 편향을 식별하고 분석합니다. 우리는 동물, 사물, 인구 통계 이미지를 포함하는 통제된 대조 벤치마크인 ProtoBias (전형성 편향)를 소개합니다. 이 벤치마크는 의미론적으로는 정확하지만 전형적이지 않은 이미지와 미묘하게 부정확하지만 전형적인 적대적 이미지 쌍을 구성하여, 지표들이 텍스트 의미를 따르는지 또는 기본적으로 전형에 의존하는지를 방향적으로 평가할 수 있도록 합니다. 실험 결과, CLIPScore, PickScore, VQA 기반 점수 등 널리 사용되는 지표들이 이러한 이미지 쌍을 자주 잘못 순위화하는 것으로 나타났습니다. 또한, LLM-as-Judge 시스템조차도 사회적으로 민감한 경우에 일관된 견고성을 보이지 않습니다. 인간 평가에서는 의미론적 정확성이 더 큰 판단 여유를 가지고 선호되는 것으로 나타났습니다. 이러한 결과를 바탕으로, 우리는 70억 개의 파라미터를 가진 강력한 지표인 ProtoScore를 제안합니다. ProtoScore는 오류 발생률을 크게 줄이고 잘못된 순위화를 억제하며, GPT-5의 추론 시간보다 훨씬 빠르게 실행되면서, 훨씬 더 큰 비공개 모델 기반 평가 시스템의 견고성에 근접하는 성능을 보입니다.

Original Abstract

Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study prototypicality bias as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark ProtoBias (Prototypical Bias), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose ProtoScore, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.

1 Citations

1 Influential

3.5 Altmetric

20.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!