2603.21754v1 Mar 23, 2026 cs.CV

효율적인 이미지 활용 사고! 동적이고 정교한 시각적 사고를 활용한 다중 모드 연쇄적 사고 추론 프레임워크

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Qiguang Chen

SCIR

Citations: 1,830

h-index: 22

Libo Qin

Citations: 708

h-index: 10

Xu Liu

Citations: 33

h-index: 2

Yongheng Zhang

Citations: 258

h-index: 6

Yao Li

Citations: 0

h-index: 0

Sheng Wang

Citations: 0

h-index: 0

최근 다중 모드 연쇄적 사고(Interleaved-modal Chain-of-Thought, ICoT) 추론은 다중 입력 및 출력을 활용하여 상당한 성공을 거두었으며, 그 중요성이 점차 커지고 있습니다. 그러나 현재의 ICoT 방법은 여전히 두 가지 주요 한계를 가지고 있습니다. (1) 정적인 시각적 사고 위치 설정: 시각 정보를 고정된 단계에서 삽입하여 비효율적이고 유연성이 떨어지는 추론을 야기합니다. (2) 단절된 시각적 사고 표현: 시각적 토큰이 단절되어 의미적으로 일관성이 없는 표현을 만듭니다. 이러한 한계를 해결하기 위해, 우리는 동적이고 정교한 시각적 사고를 활용한 다중 모드 연쇄적 사고(DaP-ICoT) 추론을 제안합니다. DaP-ICoT는 다음과 같은 두 가지 핵심 구성 요소를 포함합니다. (1) 동적 시각적 사고 통합: 추론 요구 사항에 따라 시각적 입력을 적응적으로 도입하여 중복을 줄이고 효율성을 향상시킵니다. (2) 정교한 시각적 사고 가이드: 시각적 표현이 의미적으로 일관되고 맥락에 맞도록 보장합니다. 여러 벤치마크 및 모델에 대한 실험 결과, DaP-ICoT는 최첨단 성능을 달성하는 것으로 나타났습니다. 또한, DaP-ICoT는 삽입되는 이미지 수를 크게 줄여 토큰 사용량을 72.6% 감소시켜 더욱 효율적인 ICoT 추론을 가능하게 합니다.

Original Abstract

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!