2602.08339v1 Feb 09, 2026 cs.AI

CoTZero: 계층적 합성 CoT를 통한 무주석 인간 유사 시각 추론

CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

Yazhe Niu

Citations: 162

h-index: 2

Chengyi Du

Citations: 42

h-index: 1

Dazhong Shen

Citations: 358

h-index: 6

Luxin Xu

Citations: 27

h-index: 3

최근 시각-언어 모델(VLM)의 발전으로 이미지-텍스트 정렬은 현저히 개선되었으나, 여전히 인간 수준의 시각적 추론에는 미치지 못하고 있다. 주요 한계점은 많은 VLM이 논리적으로 일관된 구조적 표현을 구축하기보다 표면적 상관관계에 의존한다는 것이며, 이는 종종 고수준 의미 구조의 누락과 비인과적 관계 이해로 이어져 구성적이고 검증 가능한 추론을 저해한다. 추론 과정에 인간 모델을 도입하여 이러한 한계를 해결하고자, 우리는 (i) 2단계 데이터 합성 접근법과 (ii) 인지 정렬 학습 방법의 두 가지 요소로 구성된 무주석 패러다임인 CoTZero를 제안한다. 첫 번째 요소에서 우리는 구성적 생산성과 전역-국소(global-to-local) 분석에 대한 신경인지적 설명에서 영감을 얻었다. 상향식(bottom-up) 단계에서 CoTZero는 원자적 시각 기본 요소(primitives)를 추출하고 이를 점진적으로 결합하여 다양하고 구조화된 질문-추론 형태로 만든다. 하향식(top-down) 단계에서는 전반적인 전역 구조를 활용해 국소적 세부 사항과 인과 관계의 해석을 유도함으로써 계층적 추론을 강화한다. 합성된 CoT 데이터를 기반으로 하는 인지 정렬 학습 요소에서는, VLM의 계층적 추론과 일반화 능력을 더욱 강화하기 위해 강화 미세 조정(RFT)에 '인지적으로 일관되고 검증 가능한 보상(CCVR)'을 도입하여 추론의 일관성과 사실적 정확성에 대한 단계별 피드백을 제공한다. 실험 결과, CoTZero는 도메인 내부 및 외부 환경 모두에서 어휘적 섭동 네거티브(lexical-perturbation negatives)가 포함된 다수준 의미 불일치 벤치마크에서 83.33%의 F1 점수를 달성했다. 소거 연구(ablation study)를 통해 각 구성 요소가 더욱 해석 가능하고 인간에 부합하는 시각적 추론에 기여함을 확인했다.

Original Abstract

Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!