2605.03782v1 May 05, 2026 cs.AI

당신이 생각하는 것이 당신이 보는 것: 시각-언어적 호기심을 통해 VLM 에이전트의 탐색 능력 향상

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

Sikai Bai

Citations: 115

h-index: 5

Haoxi Li

Citations: 75

h-index: 5

Qi Hou

Citations: 69

h-index: 3

Jinxiang Lai

Citations: 185

h-index: 7

Tao Han

Citations: 21

h-index: 2

Song Guo

Citations: 2

h-index: 1

Jianfei Ma

Hong Kong Polytechnic University

Citations: 12

h-index: 2

Jingcai Guo

Citations: 777

h-index: 18

Jiewei Zhang

Citations: 286

h-index: 10

최근의 VLM 에이전트들은 부분적으로 관찰 가능한 시각 환경에서 작동하기 위해, 명시적인 CoT(Chain-of-Thought) 추론을 통해 정책에 세계 모델링 능력을 통합하여, 행동하기 전에 미래를 정신적으로 시뮬레이션합니다. 그러나 방문한 상태에 대한 수동적인 추론만으로는 희소 보상 작업에 충분하지 않으며, 이는 견고한 일반화를 위해 필요한 "알려지지 않은 것"을 적극적으로 발견하기 위한 인식적 동기가 부족하기 때문입니다. 본 연구에서는 VLM 에이전트가 호기심 기반 탐색을 통해 내부 세계 모델을 도전하고 개선할 수 있는 신호를 적극적으로 찾을 수 있는지 질문합니다. 본 연구에서는 GLANCE라는 통합 프레임워크를 제안합니다. GLANCE는 추론과 탐색을 연결하며, 에이전트의 언어 기반 세계 모델을 진화하는 대상 네트워크의 안정적인 시각적 표현에 연결합니다. 특히, GLANCE는 강화 학습 내에서 언어적 예측과 시각적 현실 간의 불일치를 내재적인 호기심 신호로 활용하여, 에이전트의 내부 모델이 불확실한 영역을 적극적으로 탐색하도록 유도합니다. 일련의 에이전트 작업에 대한 광범위한 실험 결과는 GLANCE의 효과성을 입증하며, "에이전트가 생각하는 것"과 "에이전트가 보는 것"을 일치시키는 것이 복잡하거나 희소한 에이전트 작업 해결에 중요하다는 것을 보여줍니다.

Original Abstract

To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!