2602.01541v1 Feb 02, 2026 cs.CV

다중 모드 대규모 언어 모델에서의 인지적 초감각 능력 향상

Toward Cognitive Supersensing in Multimodal Large Language Model

Heng Ji

Citations: 352

h-index: 5

Boyi Li

UIUC

Citations: 63

h-index: 2

Yifan Shen

Citations: 30

h-index: 3

Yuanzhe Liu

Citations: 7

h-index: 1

Yifan Xu

Citations: 5

h-index: 1

Jiateng Liu

Citations: 43

h-index: 3

Xinzhuo Li

Citations: 10

h-index: 1

Zhengyuan Li

Citations: 21

h-index: 1

Jingyuan Zhu

Citations: 6

h-index: 1

Yu Zhong

Citations: 15

h-index: 2

Fangzhou Lan

Citations: 11

h-index: 1

Jianguo Cao

Citations: 505

h-index: 3

J. Rehg

Citations: 1,253

h-index: 19

Ismini Lourentzou

Citations: 1,280

h-index: 18

Xu Cao

Citations: 0

h-index: 0

다중 모드 대규모 언어 모델(MLLM)은 개방형 어휘 기반의 시각적 작업에서 놀라운 성공을 거두었지만, 특히 시각적 세부 사항이 추상적이고 시각적 기억을 필요로 할 때 복잡한 인지 문제를 해결하는 능력은 여전히 제한적입니다. 현재의 접근 방식은 주로 텍스트 공간에서 Chain-of-Thought(CoT) 추론을 확장하는 데 집중하며, 언어만으로는 명확하고 체계적인 추론이 불가능한 경우에도 이를 적용하는 경향이 있으며, 인간의 시각 공간 스케치패드 및 시각적 이미지와 유사한 시각적 추론 메커니즘을 대부분 간과합니다. 이러한 한계를 극복하기 위해, 우리는 인지적 초감각(Cognitive Supersensing)이라는 새로운 훈련 패러다임을 제안합니다. 이는 MLLM에 인간과 유사한 시각적 이미지 능력을 부여하기 위해, 시각적 인지 잠재 벡터 시퀀스를 함께 학습하고 답변과 연결하여, 시각 기반의 내부 추론 체인을 형성하는 잠재 시각적 이미지 예측(Latent Visual Imagery Prediction, LVIP) 모듈을 통합합니다. 또한, 이 시각적 잠재 벡터를 기반으로 텍스트 추론 경로를 최적화하는 강화 학습 단계를 도입합니다. MLLM의 인지 능력을 평가하기 위해, 우리는 5가지 인지 차원을 평가하는 종합적인 시각 질의 응답(VQA) 벤치마크인 CogSense-Bench를 제시합니다. 광범위한 실험 결과, 인지적 초감각으로 훈련된 MLLM은 CogSense-Bench에서 최첨단 모델보다 훨씬 뛰어난 성능을 보였으며, 도메인 외부의 수학 및 과학 VQA 벤치마크에서도 우수한 일반화 능력을 보여주었습니다. 이는 내부 시각적 이미징이 시각적 인식과 인지적 이해 사이의 격차를 해소하는 데 중요한 역할을 할 수 있음을 시사합니다. 우리는 CogSense-Bench와 모델 가중치를 공개할 예정입니다.

Original Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!