2602.10625v1 Feb 11, 2026 cs.AI

생각할 것인가 말 것인가: 마음 이론 과제에서의 대규모 추론 모델을 위한 질문

To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

Yanjie Fu

Citations: 237

h-index: 8

Nanxu Gong

Citations: 178

h-index: 8

Jianxun Lian

Citations: 6,197

h-index: 30

Sixun Dong

Citations: 101

h-index: 6

Xing Xie

Citations: 240

h-index: 9

Haotian Li

Citations: 953

h-index: 19

마음 이론(ToM)은 모델이 믿음, 욕구, 의도와 같은 숨겨진 정신 상태를 추론할 수 있는지를 평가하며, 이는 자연스러운 사회적 상호작용에 필수적이다. 최근 대규모 추론 모델(LRM)의 발전이 수학 및 코딩에서의 단계별 추론 능력을 향상시켰지만, 이러한 이점이 사회 인지 기술로 전이되는지는 아직 충분히 탐구되지 않았다. 본 연구는 9개의 첨단 대규모 언어 모델(LLM)에 대한 체계적인 연구를 통해 3개의 대표적인 ToM 벤치마크에서 추론 모델과 비추론 모델을 비교한다. 연구 결과, 추론 모델이 비추론 모델보다 일관되게 우수한 성능을 보이는 것은 아니며, 때로는 더 낮은 성능을 보이는 것으로 나타났다. 정밀 분석을 통해 세 가지 통찰을 도출했다. 첫째, '느린 생각(slow thinking)의 붕괴'이다. 응답이 길어질수록 정확도가 급격히 떨어지며, 추론 예산이 커질수록 성능이 저하된다. 둘째, 적절하고 적응적인 추론이 성능에 도움이 된다. 추론 길이를 제한하면 실패가 완화되는 한편, 뚜렷한 성공 패턴들은 동적 적응의 필요성을 보여준다. 셋째, '선택지 매칭 지름길' 현상이다. 객관식 선택지를 제거했을 때 추론 모델의 성능이 뚜렷하게 향상되었는데, 이는 진정한 연역보다는 선택지 매칭에 의존하고 있음을 시사한다. 또한 우리는 문제점을 추가로 검증하고 완화하기 위해, S2F(Slow-to-Fast) 적응형 추론과 T2M(Think-to-Match) 지름길 방지라는 두 가지 개입 접근법을 설계했다. 종합적인 결과를 통해, 본 연구는 형식적 추론(예: 수학, 코드)에서의 LRM의 발전이 사회적 추론의 전형적인 과제인 ToM으로 완전히 전이될 수는 없음을 강조한다. 결론적으로 견고한 ToM을 달성하기 위해서는 기존의 추론 방식을 넘어선 고유한 역량 개발이 필요하다.

Original Abstract

Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.

1 Citations

0 Influential

15 Altmetric

76.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!