2602.16902v3 Feb 18, 2026 cs.AI

LLM-WikiRace 벤치마크: LLM이 실제 지식 그래프를 기반으로 얼마나 효과적인 계획을 수립할 수 있는가?

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Juliusz Ziomek

University of Oxford

Citations: 103

h-index: 5

William Bankes

Citations: 48

h-index: 4

Lorenz Wolf

Citations: 2

h-index: 1

S. Ramesh

Citations: 200

h-index: 6

Xiaohang Tang

Citations: 54

h-index: 4

Ilija Bogunovic

Citations: 1,941

h-index: 24

본 논문에서는 대규모 언어 모델(LLM)의 계획 수립 능력, 추론 능력, 그리고 세계 지식 이해도를 평가하기 위한 벤치마크인 LLM-Wikirace를 소개합니다. LLM-Wikirace에서 모델은 주어진 시작 페이지에서 목표 페이지에 도달하기 위해 위키피디아의 하이퍼링크를 단계별로 효율적으로 탐색해야 하며, 이는 예측 계획 수립 능력과 실제 세계의 개념 간의 연결 관계에 대한 추론 능력을 요구합니다. Gemini-3, GPT-5, Claude Opus 4.5를 포함한 다양한 오픈 소스 및 클로즈드 소스 모델을 평가한 결과, 쉬운 난이도에서는 뛰어난 성능을 보이며 인간을 능가하는 수준을 달성했습니다. 하지만 어려운 난이도에서는 성능이 급격히 저하되었으며, 가장 뛰어난 성능을 보이는 모델인 Gemini-3조차도 어려운 난이도 문제에서 23%의 성공률을 기록했습니다. 분석 결과, 세계 지식은 성공에 필수적인 요소이지만, 일정 수준을 넘어서면 계획 수립 능력과 장기적인 추론 능력이 더 중요한 역할을 합니다. 또한, 트래jectory 수준의 분석을 통해, 가장 뛰어난 모델조차도 실패 후 재계획을 하는 데 어려움을 겪으며, 종종 오류를 복구하는 대신 반복적인 오류를 범하는 경향이 있음을 확인했습니다. LLM-Wikirace는 현재 추론 시스템의 명확한 한계를 보여주는 간단한 벤치마크이며, 계획 수립 능력을 갖춘 LLM이 앞으로 발전해야 할 부분이 많음을 시사합니다. 관련 코드 및 랭킹은 https:/llmwikirace.github.io에서 확인할 수 있습니다.

Original Abstract

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!