2601.02991v1 Jan 06, 2026 cs.CV

작은 규모의 멀티모달 대규모 언어 모델(MLLM)을 위한 만화 기반의 신뢰성 있는 추론

Towards Faithful Reasoning in Comics for Small MLLMs

Chengcheng Feng

Citations: 4

h-index: 1

Hao Yin

Citations: 18

h-index: 2

Yucheng Jin

Citations: 16

h-index: 2

Kaizhu Huang

Citations: 390

h-index: 4

만화 기반 시각 질의 응답(CVQA)은 상징적 추상화, 서사 논리, 유머에 의존하기 때문에 기존의 시각 질의 응답 작업과는 다른 방식으로 멀티모달 대규모 언어 모델(MLLM)에 고유한 과제를 제시합니다. 체인 오브 소트(Chain-of-Thought, CoT) 프롬프트는 MLLM의 추론 능력을 향상시키는 데 널리 사용되지만, 놀랍게도 CVQA에 직접 적용하면 성능이 저하되는 경우가 많으며, 특히 소규모 모델에서 이러한 현상이 두드러집니다. 우리의 이론적 및 실증적 분석에 따르면, CVQA에서 사용되는 표준 CoT는 상태 얽힘, 부차적인 전환, 탐색 비효율성 문제를 가지고 있으며, 이러한 문제는 특히 자원 제약적인 환경에서 소규모 모델에 더욱 심각하게 나타납니다. 이러한 문제를 해결하기 위해, 우리는 소규모 MLLM에서 더욱 신뢰성 있고 일반화 가능한 추론 체인을 생성하도록 설계된 새로운 만화 추론 프레임워크를 제안합니다. 구체적으로, 우리의 프레임워크는 모듈식 CoT 생성, GRPO 기반 강화 학습 미세 조정, 그리고 새로운 구조화된 보상을 결합합니다. 만화 기반 시각 질의 응답 외에도, 우리는 밈 이해 및 풍자 만화 해석을 포함한 다양한 유머 중심적이고 추상적인 시각 추론 작업에 대한 평가를 수행했습니다. 다섯 가지 어려운 벤치마크에서, 우리의 30억 파라미터 모델은 최첨단 방법보다 우수한 성능을 보였으며, 다양한 MLLM에 적용한 실험 결과 평균적으로 $f{12.1%}$의 추가적인 성능 향상을 달성했습니다.

Original Abstract

Comic-based visual question answering (CVQA) poses distinct challenges to multimodal large language models (MLLMs) due to its reliance on symbolic abstraction, narrative logic, and humor, which differ from conventional VQA tasks. Although Chain-of-Thought (CoT) prompting is widely used to enhance MLLM reasoning, surprisingly, its direct application to CVQA often degrades performance, especially in small-scale models. Our theoretical and empirical analyses reveal that standard CoT in CVQA suffers from state entanglement, spurious transitions, and exploration inefficiency, with small models particularly vulnerable in resource-constrained settings. To address these issues, we propose a novel comic reasoning framework, designed to produce more faithful and transferable reasoning chains in small MLLMs. Specifically, our framework combines modular CoT generation with GRPO-based reinforcement fine-tuning and a novel structured reward. Beyond comic VQA, we further evaluate our approach on a broader class of humor-centric and abstract visual reasoning tasks, including meme understanding and editorial cartoon interpretation. Across five challenging benchmarks, our 3B model outperforms state-of-the-art methods, and plug-in experiments yield an additional average improvement of $\mathbf{12.1\%}$ across different MLLMs.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!