2602.18729v1 Feb 21, 2026 cs.CV

MiSCHiEF: 세밀한 이미지-캡션 정렬의 종합적 평가를 위한 안전 및 문화 최소 대립쌍 벤치마크

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Shivank Garg

Citations: 37

h-index: 4

Tangatar Madi

Citations: 0

h-index: 0

A.C. Swaminathan

Citations: 1

h-index: 1

N. Anh

Citations: 0

h-index: 0

Vasu Sharma

Citations: 821

h-index: 7

Kevin Zhu

Citations: 252

h-index: 4

Sagarika Banerjee

Citations: 0

h-index: 0

세밀한 이미지-캡션 정렬은 비전-언어 모델(VLM)에 있어 매우 중요하며, 특히 실제 위험 시나리오 식별이나 문화적 대용물(cultural proxies) 구분과 같은 사회적으로 중대한 맥락에서 더욱 그러하다. 이러한 맥락에서는 미묘한 시각적 또는 언어적 단서에 의해 올바른 해석이 좌우되며, 사소한 오해석이 현실 세계에 중대한 결과를 초래할 수 있다. 본 논문에서는 안전(MiS) 및 문화(MiC) 영역의 대조적 쌍 설계를 기반으로 한 두 가지 벤치마킹 데이터셋 세트인 MiSCHiEF를 제안하고, 짝을 이룬 이미지와 캡션에 대한 세밀한 구분을 요구하는 태스크를 통해 4가지 VLM을 평가한다. 두 데이터셋에서 각 샘플은 최소한의 차이를 지닌 두 개의 캡션과 이에 대응하여 최소한의 차이를 지닌 이미지들을 포함한다. MiS의 이미지-캡션 쌍은 안전한 시나리오와 안전하지 않은 시나리오를 묘사하며, MiC의 경우에는 서로 다른 두 문화적 배경의 문화적 대용물을 묘사한다. 연구 결과, 모델들은 일반적으로 잘못된 이미지-캡션 쌍을 기각하는 것보다 올바른 이미지-캡션 쌍을 확인하는 데 더 우수한 성능을 보였다. 또한, 모델은 주어진 이미지에 대해 매우 유사한 두 캡션 중 올바른 캡션을 선택할 때, 그 반대의 태스크를 수행할 때보다 더 높은 정확도를 달성하였다. 전반적으로 이 결과는 현재 VLM에 지속적으로 존재하는 모달리티 정렬 문제의 한계를 부각시키며, 미묘한 의미적, 시각적 구분이 필요한 애플리케이션에 요구되는 정밀한 교차 모달 그라운딩(cross-modal grounding)의 어려움을 강조한다.

Original Abstract

Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!