2602.23898v1 Feb 27, 2026 cs.CV

Ref-Adv: 참조 표현 작업에서의 멀티모달 대규모 언어 모델의 시각적 추론 탐구

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Kuo Yang

Citations: 114

h-index: 3

Lin Ju

Citations: 39

h-index: 3

Handong Zhao

Citations: 58

h-index: 3

Yitian Zhang

Citations: 43

h-index: 3

Yizhou Wang

Citations: 480

h-index: 6

Huimin Zeng

Citations: 26

h-index: 3

Jianglin Lu

Citations: 41

h-index: 3

Yun Fu

Citations: 81

h-index: 3

Qihua Dong

Northeastern University

Citations: 169

h-index: 4

참조 표현 이해(REC)는 언어를 영역 수준의 시각적 인지와 연결합니다. 기존 벤치마크(RefCOCO, RefCOCO+, RefCOCOg)는 멀티모달 LLM의 발전으로 빠르게 개선되었지만, 여전히 시각적 추론 및 지각 능력을 제대로 평가하기에는 부족합니다. (i) 많은 표현이 매우 짧아 추론 요구 사항이 적고, (ii) 이미지에는 종종 적은 주의 분산 요소가 포함되어 목표 객체를 쉽게 찾을 수 있으며, (iii) 중복된 설명은 진정한 텍스트 이해 및 시각적 추론을 우회하는 단순 해결책을 가능하게 합니다. 본 연구에서는 이러한 한계를 극복하기 위해 Ref-Adv라는 새로운 REC 벤치마크를 소개합니다. Ref-Adv는 언어학적으로 복잡한 표현과 목표 객체를 유일하게 식별하는 데 필요한 정보만 결합하여 단순 해결책을 방지합니다. 이 데이터 세트는 실제 이미지에 대한 참조 표현을 포함하며, 어려운 주의 분산 요소로 구성되고, 부정(negation)을 포함한 추론 측면이 주석으로 제공됩니다. 우리는 Ref-Adv 문제를 해결하는 데 단순한 단서 이상의 추론이 필요함을 보여주기 위해 다양한 실험을 수행했습니다(단어 순서 변경 및 설명 삭제). 또한, 최신 멀티모달 LLM 모델들을 Ref-Adv 벤치마크로 평가했습니다. RefCOCO, RefCOCO+, RefCOCOg에서 좋은 성능을 보이는 모델들이 Ref-Adv에서는 성능이 크게 저하되는 것을 확인했습니다. 이는 모델들이 단순 해결책에 의존하고 있으며, 시각적 추론 및 지각 능력에 격차가 있음을 보여줍니다. 본 연구에서는 상세한 실패 분석을 제공하고, Ref-Adv가 멀티모달 LLM의 시각적 추론 및 지각 능력 연구의 발전에 기여할 수 있도록 합니다.

Original Abstract

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

3 Citations

0 Influential

3 Altmetric

18.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!