2604.22884v1 Apr 24, 2026 cs.CV

멀티모달 대규모 언어 모델이 과연 작은 객체를 진정으로 이해할 수 있는가?

Can Multimodal Large Language Models Truly Understand Small Objects?

Jingqi Ye

Citations: 3

h-index: 1

Fujun Han

Citations: 55

h-index: 5

Junan Chen

Citations: 5

h-index: 2

Xintong Zhu

Citations: 17

h-index: 2

Xuanjie Mao

Citations: 0

h-index: 0

Tao Chen

Citations: 12

h-index: 2

Peng Ye

Citations: 22

h-index: 3

멀티모달 대규모 언어 모델(MLLM)은 다양한 이해 작업에서 유망한 가능성을 보여왔습니다. 예를 들어, 이미지 및 비디오 분석, 수학 및 물리 올림피아드 등이 있습니다. 그러나 이러한 모델은 소형 객체 이해(SOU) 작업에 대해서는 아직 연구가 부족합니다. 이러한 격차를 해소하기 위해, 본 연구에서는 기존 MLLM의 소형 객체 이해 능력을 평가하기 위한 최초이자 포괄적인 벤치마크인 SOUBench를 소개합니다. 구체적으로, 우리는 먼저 효과적이고 자동화된 시각적 질문-응답 생성 전략을 설계하여 새로운 SOU-VQA 평가 데이터셋을 구축했습니다. 이 데이터셋은 18,204개의 질문-응답 쌍, 여섯 가지 관련 세부 작업, 그리고 세 가지 주요 시나리오(즉, 자동차, 항공, 수중)로 구성됩니다. 그런 다음, 우리는 15개의 최첨단 MLLM을 대상으로 종합적인 평가를 수행하고, 이들이 소형 객체 이해에 있어 가지는 약점을 밝혀냈습니다. 또한, MLLM의 SOU 능력을 향상시키기 위해 11,226개의 질문-응답 쌍으로 구성된 멀티모달 학습 데이터셋인 SOU-Train을 개발했습니다. 최신 MLLM을 지도 학습 방식으로 미세 조정하여, SOU-Train이 최신 MLLM의 소형 객체 이해 능력을 효과적으로 향상시킬 수 있음을 입증했습니다. 종합적인 실험 결과는 제안된 SOUBench와 SOU-VQA 및 SOU-Train 데이터셋이, 향상된 소형 객체 이해 능력을 가진 모델을 더욱 발전시키기 위한 중요한 실증적 기반을 제공한다는 것을 보여줍니다. 데이터셋 및 코드: https://github.com/Hanfj-X/SOU

Original Abstract

Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.

0 Citations

0 Influential

29.431471805599 Altmetric

147.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!