2601.22730v1 Jan 30, 2026 cs.CV

ImgCoT: 긴 사고 과정(Chain of Thought)을 효율적인 대규모 언어 모델 추론을 위한 압축된 시각적 토큰으로 변환하는 방법

ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model

Xiaoshu Chen

Citations: 39

h-index: 3

Sihang Zhou

Citations: 6,229

h-index: 42

K. Liang

Citations: 2,216

h-index: 24

Taichun Zhou

Citations: 2

h-index: 1

Xinwang Liu

Citations: 9,658

h-index: 52

대규모 언어 모델(LLM)을 활용한 효율적인 추론을 위해서는 긴 사고 과정(CoT)을 압축된 잠재 토큰으로 변환하는 것이 중요합니다. 최근 연구에서는 오토인코더를 사용하여 잠재 토큰으로부터 텍스트 기반 CoT를 재구성함으로써 CoT의 의미를 부호화합니다. 그러나 텍스트 기반 CoT를 재구성 대상으로 삼는 것은 잠재 토큰이 단어 선택 및 구문과 같은 표면적인 언어적 특징을 보존하도록 강제하여, 추론 구조보다 언어적 형태를 우선시하는 강력한 언어적 편향을 도입하고 논리적 추상화를 제한합니다. 따라서, 우리는 텍스트 기반 CoT 대신, CoT를 이미지로 렌더링하여 얻은 시각적 CoT를 재구성 대상으로 사용하는 ImgCoT를 제안합니다. 이를 통해 언어적 편향을 공간적 편향, 즉 시각적 CoT에서 추론 단계의 공간적 구조를 모델링하려는 경향으로 대체하여, 잠재 토큰이 전반적인 추론 구조를 더 잘 포착할 수 있도록 합니다. 또한, 시각적 잠재 토큰은 추상적인 추론 구조를 부호화하지만, 추론 세부 사항을 흐릴 수 있습니다. 따라서, 우리는 토큰의 로그-likelihood가 낮은 몇 가지 핵심 텍스트 기반 추론 단계를 시각적 잠재 토큰과 결합하는 하이브리드 추론 방식인 '자유로운 ImgCoT'를 제안합니다. 이 설계는 LLM이 전체적인 추론 구조와 미세한 추론 세부 사항을 모두 유지하면서, 전체 CoT보다 적은 수의 토큰으로 이를 수행할 수 있도록 합니다. 다양한 데이터셋과 LLM에 대한 광범위한 실험 결과, ImgCoT의 두 가지 버전 모두 효과적임을 보여줍니다.

Original Abstract

Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.

1 Citations

0 Influential

26 Altmetric

131.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!