2602.04413v1 Feb 04, 2026 cs.CL

자기 교정을 통한 반복적 시각적 추론: 역사 정보를 활용

History-Guided Iterative Visual Reasoning with Self-Correction

Zhan Liu

Citations: 269

h-index: 8

Xinglong Yang

Citations: 9

h-index: 1

Zhili Peng

Citations: 15

h-index: 2

Haochen Shi

Citations: 68

h-index: 3

Shengjun Huang

Citations: 5

h-index: 2

자기 일관성(self-consistency) 방법은 다중 모드 대규모 언어 모델(MLLM)의 추론 신뢰성을 향상시키는 핵심 기술입니다. 이러한 방법은 반복적인 샘플링을 통해 여러 추론 결과를 생성하고, 투표를 통해 최적의 답변을 선택함으로써 교차 모드 작업에서 중요한 역할을 합니다. 그러나 대부분의 기존 자기 일관성 방법은 고정된 '반복적인 샘플링 및 투표' 패러다임에 제한되어 있으며, 과거의 추론 정보를 재사용하지 않습니다. 그 결과, 모델은 시각적 이해 오류를 적극적으로 수정하고 반복적인 추론 과정에서 동적으로 추론 방식을 조정하는 데 어려움을 겪습니다. 인간의 반복적인 검증 및 동적 오류 수정 추론 행동에서 영감을 받아, 우리는 H-GIVR 프레임워크를 제안합니다. 반복적인 추론 과정에서 MLLM은 이미지를 여러 번 관찰하고, 이전에 생성된 답변을 후속 단계에 대한 참조 자료로 사용하여 오류를 동적으로 수정하고 답변 정확도를 향상시킵니다. 우리는 다섯 개의 데이터셋과 세 개의 모델에 대한 포괄적인 실험을 수행했습니다. 결과는 H-GIVR 프레임워크가 교차 모드 추론 정확도를 크게 향상시키면서도 낮은 계산 비용을 유지할 수 있음을 보여줍니다. 예를 들어, ScienceQA 데이터셋에서 exttt{Llama3.2-vision:11b} 모델을 사용하여 평균적으로 2.57개의 응답을 통해 78.90%의 정확도를 달성했으며, 이는 기준 모델 대비 107%의 성능 향상입니다.

Original Abstract

Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!