2604.03374v1 Apr 03, 2026 cs.CL

CresOWLve: 실제 지식을 활용한 창의적 문제 해결 능력 평가

CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

Antoine Bosselut

EPFL

Citations: 14,518

h-index: 36

Mete Ismayilzada

EPFL

Citations: 440

h-index: 8

Lonneke van der Plas

Citations: 129

h-index: 6

Renqing Cuomao

Citations: 0

h-index: 0

Daniil Yurshevich

Citations: 0

h-index: 0

Anna Sotnikova

University of Maryland

Citations: 131

h-index: 4

창의적 문제 해결은 논리적 추론, lateral thinking, 유추 능력, 상식 등 다양한 인지 능력을 결합하여, 겉으로는 관련 없어 보이는 정보 조각들을 연결하여 통찰력을 발견하는 과정을 필요로 합니다. 그러나, 대부분의 대규모 언어 모델(LLM) 평가 벤치마크는 이 과정의 특정 요소만을 평가합니다. 또한, 많은 창의성 관련 벤치마크는 인위적으로 만들어진 수수께끼나 현실 세계의 문제 해결 방식을 반영하지 않는 가상의 시나리오에 의존합니다. 이러한 격차를 해소하기 위해, 우리는 실제 지식을 기반으로 창의적 문제 해결 능력을 평가하는 벤치마크인 CresOWLve를 소개합니다. CresOWLve의 문제들은 다양한 창의적 사고 전략을 활용하고, 다양한 분야의 사실들을 검색하여, 창의적으로 결합하여 해결해야 합니다. 우리는 여러 최첨단 비사고형 및 사고형 LLM을 평가한 결과, CresOWLve가 여전히 매우 어려운 과제임을 보여줍니다. 분석 결과, 모델들은 사실 기반 질문에 대해서는 훨씬 더 나은 성능을 보이는 반면, 창의적인 질문에 대해서는 상당한 성능 저하(최대 -17%)를 보입니다. 모델들은 관련 지식을 검색하는 데는 능숙하지만, 이 정보를 통합하고 정답을 얻기 위해 필요한 명확하지 않은 창의적 연결을 형성하는 데 어려움을 겪습니다.

Original Abstract

Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.

0 Citations

0 Influential

18 Altmetric

90.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!