2602.17831v1 Feb 19, 2026 cs.AI

토큰 게임: 퍼즐 대결을 통한 언어 모델의 추론 능력 평가

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Citations: 17

h-index: 1

Citations: 941

h-index: 13

대형 언어 모델이 발전함에 따라 모델의 추론 능력을 평가하는 것은 점점 더 어려워지고 있다. 어려운 질문을 사람이 직접 큐레이션하는 것은 매우 큰 비용이 들며, 특히 가장 뛰어난 모델들을 시험하기 위해 박사 수준의 도메인 지식을 활용하는 최근 벤치마크에서는 더욱 그러하다. 그럼에도 불구하고, 이러한 질문들이 진정한 추론 능력을 평가하는 것인지, 아니면 훈련 과정에서 유사한 문제를 본 적이 있는 것인지에 대한 우려는 항상 존재한다. 본 연구에서는 16세기 수학 대결에서 영감을 받아, 모델들이 직접 퍼즐을 만들어 서로 대결하는 평가 프레임워크인 '토큰 게임(The Token Games, TTG)'을 설계했다. 우리는 불리언(boolean)을 반환하는 파이썬 함수가 주어졌을 때 'True'를 반환하게 만드는 입력값을 찾는 '프로그래밍 퍼즐' 형식을 활용하여, 문제를 유연하게 표현하고 해답을 검증할 수 있도록 하였다. 그런 다음 1대1 대결 결과를 바탕으로 엘로(Elo) 평점을 계산하여 모델들을 상대적으로 비교할 수 있게 하였다. 우리는 10개의 최첨단 모델을 TTG에서 평가하였으며, 퍼즐 생성에 인간의 노력을 전혀 들이지 않고도 '인류의 마지막 시험(Humanity's Last Exam)'과 같은 기존 벤치마크의 순위와 매우 유사한 결과를 도출했다. 또한 훌륭한 퍼즐을 만드는 것은 이전 벤치마크에서는 측정되지 않았던, 현재 모델들에게 여전히 매우 도전적인 과제임을 발견했다. 전반적으로 우리의 연구는 구조적으로 포화(saturation)될 수 없는 추론 평가의 새로운 패러다임을 제시하며, 문제 해결 능력과 더불어 창의성 및 과제 생성과 같은 모델의 다른 능력들을 함께 테스트할 수 있게 해준다.

Original Abstract

Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!