2601.06931v1 Jan 11, 2026 cs.CV

실제 사진에서 추출한 얼굴 정보만을 활용한 반사실적 평가를 통한 시각-언어 모델의 사회적 편향 측정

Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos

Hao Chen

Citations: 2

h-index: 1

Jiaqi Zhao

Citations: 0

h-index: 0

Qiuping Jiang

Citations: 14

h-index: 2

Xiaojun Chang

Citations: 39

h-index: 3

Jun Yu

Citations: 5

h-index: 2

Qiang Huang

Harbin Institute of Technology (Shenzhen)

Citations: 792

h-index: 14

시각-언어 모델(VLMs)은 점점 더 사회적으로 중요한 분야에 적용되고 있으며, 이는 인구 통계학적 정보에 따른 사회적 편향 문제를 야기합니다. 이러한 사회적 편향을 측정하는 데 있어 중요한 과제는 시각적 혼란 요인 하에서의 원인 규명입니다. 실제 이미지에서는 인종과 성별이 배경, 의상 등 상관관계가 있는 요인들과 얽혀 있어 원인 규명이 어렵습니다. 본 연구에서는 얼굴 정보만을 변경하여 실제 이미지의 현실성을 유지하면서 인구 통계학적 효과를 분리하는 **얼굴 정보 기반 반사실적 평가 패러다임**을 제안합니다. 실제 사진을 기반으로, 인종 및 성별과 관련된 얼굴 속성만 편집하고 다른 모든 시각적 요인은 고정하여 반사실적 변형 이미지를 생성합니다. 이러한 패러다임을 바탕으로, 6가지 직업과 10개의 인구 통계학적 그룹에 걸쳐 480개의 장면 일치하는 반사실적 이미지를 포함하는 데이터셋 **FOCUS**를 구축하고, 3가지 의사 결정 중심의 평가 항목인 이분법 강제 선택, 다지선다형 사회경제적 추론, 수치형 급여 추천을 포함하는 벤치마크 **REFLECT**를 제안합니다. 5가지 최첨단 VLM 모델에 대한 실험 결과, 엄격한 시각적 제어 조건 하에서도 인구 통계학적 불평등이 지속되며, 작업 유형에 따라 현저하게 달라지는 것을 확인했습니다. 이러한 결과는 통제된 반사실적 감사 및 다중 모드 모델의 사회적 편향 평가에서 작업 설계의 중요성을 강조합니다.

Original Abstract

Vision-Language Models (VLMs) are increasingly deployed in socially consequential settings, raising concerns about social bias driven by demographic cues. A central challenge in measuring such social bias is attribution under visual confounding: real-world images entangle race and gender with correlated factors such as background and clothing, obscuring attribution. We propose a \textbf{face-only counterfactual evaluation paradigm} that isolates demographic effects while preserving real-image realism. Starting from real photographs, we generate counterfactual variants by editing only facial attributes related to race and gender, keeping all other visual factors fixed. Based on this paradigm, we construct \textbf{FOCUS}, a dataset of 480 scene-matched counterfactual images across six occupations and ten demographic groups, and propose \textbf{REFLECT}, a benchmark comprising three decision-oriented tasks: two-alternative forced choice, multiple-choice socioeconomic inference, and numeric salary recommendation. Experiments on five state-of-the-art VLMs reveal that demographic disparities persist under strict visual control and vary substantially across task formulations. These findings underscore the necessity of controlled, counterfactual audits and highlight task design as a critical factor in evaluating social bias in multimodal models.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!