2605.02378v1 May 04, 2026 cs.CV

귀납-연역 추론을 통한 다중 모드 인-컨텍스트 학습 강화

Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

Jiahong Yan

Citations: 17

h-index: 2

Gang Liu

Citations: 37

h-index: 3

Jun Chen

Citations: 761

h-index: 7

Yanghua Xiao

Citations: 138

h-index: 7

Yuyan Chen

Citations: 386

h-index: 10

Qian Wang

Citations: 7

h-index: 1

Haoyu Wang

Citations: 25

h-index: 2

Haonan Wang

Citations: 1,464

h-index: 13

인-컨텍스트 학습(ICL)은 대규모 모델이 몇 가지 예시를 통해 특정 작업에 적응할 수 있도록 하지만, 이러한 방식이 비전-언어 모델(VLM)로 확장되는 데는 어려움이 있습니다. 본 연구에서는 이러한 어려움의 근본적인 원인이 귀납 격차에 있음을 밝혀냈습니다. 모델은 종종 잘못된 추론 과정에서 정답을 도출하지만, 제시된 예시들에서 일관된 규칙을 추출하는 데 어려움을 겪습니다. 이러한 격차는 시각적인 측면에서 두 가지 문제로 인해 더욱 심화됩니다. 첫째, 텍스트적인 단서를 가려내는 불필요한 시각적 토큰의 비율이 과도하게 높고, 둘째, 모델의 주의 집중 분포가 초기에 제시된 이미지에 편중되어 이후의 맥락을 고려하지 못하는 것입니다. 이러한 문제들을 해결하기 위해, 본 연구에서는 다중 모드 ICL을 체계적인 귀납-연역 과정으로 재구성하는 프레임워크를 제안합니다. 이 프레임워크는 불필요한 패치를 제거하는 유사성 기반 시각적 토큰 압축 모듈, 모든 이미지에 걸쳐 균형 있게 주의를 분산시키는 동적 주의 재조정 메커니즘, 그리고 개별 예시를 분석하고, 일반화 가능한 규칙을 도출한 다음, 이를 쿼리에 적용하도록 모델을 명시적으로 안내하는 체인-오브-생트(chain-of-thought) 패러다임을 포함합니다. 또한, 검증 가능한 보상을 활용한 강화 학습과 함께 지도 학습을 결합한 보조 학습 파이프라인을 통해 모델이 정확한 정보를 인용하고 노이즈를 효과적으로 제거하도록 훈련합니다. 시각적 인식, 논리적 추론, STEM 문제, 그리고 풍자 감지 등 8가지 벤치마크를 통해 수행한 실험 결과, 제안하는 프레임워크는 여러 공개 VLM에서 기존 ICL 방식보다 일관성 있고 현저하게 개선된 성능을 보였으며, 이는 다중 모드 환경에서 모델에게 진정한 귀납 능력을 부여할 수 있는 가능성을 보여줍니다.

Original Abstract

In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!