2603.02748v1 Mar 03, 2026 cs.CV

iGVLM: 동적 지시-기반 시각 인코딩을 통한 질문 인지 멀티모달 이해

iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

Yaqian Li

Citations: 86

h-index: 5

Zidan Wang

Citations: 0

h-index: 0

Shuoxi Zhang

Citations: 34

h-index: 3

Zihao Bo

Citations: 6

h-index: 1

Rinyoichi Takezoe

Citations: 36

h-index: 2

Kaiwen Long

Citations: 61

h-index: 1

Kun He

Citations: 26

h-index: 2

HanZpeng Liu

Citations: 0

h-index: 0

대규모 시각-언어 모델(LVLM)의 성공에도 불구하고, 대부분의 기존 아키텍처는 표현 병목 현상을 겪습니다. 이는 정적이고 지시에 독립적인 시각 인코더에 의존하며, 이들의 시각적 표현은 다양한 텍스트 기반 작업에 대해 동일하게 사용됩니다. 이러한 경직성은 세밀한 추론을 방해하는데, 특히 작업별 특정 시각적 단서가 중요한 경우에 문제가 됩니다. 이러한 문제를 해결하기 위해, 본 논문에서는 지시-기반 시각 변조를 위한 일반적인 프레임워크인 iGVLM을 제안합니다. iGVLM은 분리된 이중 분기 아키텍처를 도입합니다. 여기에는 사전 훈련 중에 학습된 작업에 독립적인 시각적 표현을 유지하는 고정된 표현 분기와, 어댑티브 레이어 정규화(AdaLN)를 통해 특징을 동적으로 변조하는 동적 조건 분기가 포함됩니다. 이러한 설계는 일반적인 인지에서 지시-인식 추론으로의 원활한 전환을 가능하게 하면서, 사전 훈련된 시각적 사전 지식의 구조적 무결성과 안정성을 유지합니다. 표준 벤치마크 외에도, 다중 쿼리와 다중 지시 환경에서 논리적 일관성을 정량화하기 위한 제어된 진단 도구인 MM4를 소개합니다. 광범위한 결과는 iGVLM이 다양한 언어 기반 모델에서 일관적으로 지시 민감도를 향상시킨다는 것을 보여주며, 이는 수동적인 인지와 능동적인 추론을 연결하는 플러그 앤 플레이 패러다임을 제공합니다.

Original Abstract

Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!