2601.17037v1 Jan 20, 2026 cs.CV

AMVICC: 시각 언어 모델 및 이미지 생성 모델의 교차 모드 오류 모드 분석을 위한 새로운 벤치마크

AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs

Aahana Basappa

Citations: 0

h-index: 0

P. Goel

Citations: 142

h-index: 4

Anusri Karra

Citations: 0

h-index: 0

Anish Karra

Citations: 0

h-index: 0

A. Gilmore

Citations: 7

h-index: 1

Kevin Zhu

Citations: 252

h-index: 4

본 연구에서는 새로운 벤치마크를 개발하여 이미지-텍스트 및 텍스트-이미지 작업 전반에 걸쳐 오류 모드를 체계적으로 비교함으로써, 다중 모드 대규모 언어 모델(MLLM)과 이미지 생성 모델(IGM)의 시각적 추론 한계를 조사했습니다. 머신러닝의 급속한 발전에도 불구하고, 시각 언어 모델(VLM)은 여전히 객체 방향, 수량 또는 공간 관계와 같은 기본적인 시각적 개념을 이해하거나 생성하는 데 어려움을 겪고 있으며, 이는 기본적인 시각적 추론 능력의 격차를 보여줍니다. MMVP 벤치마크 질문을 명시적 및 암시적 프롬프트로 변환하여, 다양한 모드 간의 오류 모드를 분석하기 위한 새로운 벤치마크인 extit{AMVICC}를 개발했습니다. 9가지 시각적 추론 범주에서 11개의 MLLM과 3개의 IGM을 테스트한 결과, 오류 모드는 종종 모델과 모드 간에 공유되는 경향이 있지만, 특정 오류는 모델별 및 모드별로 나타나는 것으로 확인되었으며, 이는 다양한 요인에 기인할 수 있습니다. IGM은 특히 명시적인 프롬프트에서, 프롬프트에 대한 특정 시각적 구성 요소를 조작하는 데 지속적으로 어려움을 겪었으며, 이는 미세한 시각적 속성에 대한 제어 능력이 부족함을 시사합니다. 본 연구 결과는 기존 최고 성능 모델의 구조화된 시각적 추론 작업 평가에 가장 직접적으로 적용됩니다. 본 연구는 향후 교차 모드 정렬 연구의 기반을 마련하며, 생성 및 해석 실패가 공유된 제한에서 비롯되는지 여부를 조사하기 위한 프레임워크를 제공하여, 통합된 시각-언어 모델링의 향후 개선 방향을 제시합니다.

Original Abstract

We investigated visual reasoning limitations of both multimodal large language models (MLLMs) and image generation models (IGMs) by creating a novel benchmark to systematically compare failure modes across image-to-text and text-to-image tasks, enabling cross-modal evaluation of visual understanding. Despite rapid growth in machine learning, vision language models (VLMs) still fail to understand or generate basic visual concepts such as object orientation, quantity, or spatial relationships, which highlighted gaps in elementary visual reasoning. By adapting MMVP benchmark questions into explicit and implicit prompts, we create \textit{AMVICC}, a novel benchmark for profiling failure modes across various modalities. After testing 11 MLLMs and 3 IGMs in nine categories of visual reasoning, our results show that failure modes are often shared between models and modalities, but certain failures are model-specific and modality-specific, and this can potentially be attributed to various factors. IGMs consistently struggled to manipulate specific visual components in response to prompts, especially in explicit prompts, suggesting poor control over fine-grained visual attributes. Our findings apply most directly to the evaluation of existing state-of-the-art models on structured visual reasoning tasks. This work lays the foundation for future cross-modal alignment studies, offering a framework to probe whether generation and interpretation failures stem from shared limitations to guide future improvements in unified vision-language modeling.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!