2602.01075v1 Feb 01, 2026 cs.AI

ConvexBench: LLM은 볼록 함수를 인식할 수 있는가?

ConvexBench: Can LLMs Recognize Convex Functions?

Yuheng Bu

Citations: 149

h-index: 6

Ye Liu

Citations: 282

h-index: 10

Yu Huang

Citations: 235

h-index: 5

Yu-Xiang Wang

Citations: 0

h-index: 0

Yingbin Liang

Citations: 169

h-index: 5

볼록 해석학(Convex analysis)은 많은 응용 분야를 가진 현대 수학의 한 갈래입니다. 거대 언어 모델(LLM)이 연구 수준의 수학과 과학을 자동화하기 시작함에 따라, LLM이 볼록성(convexity)을 이해하고 추론하는 능력을 보여주는 것이 중요해졌습니다. 우리는 깊은 함수 합성을 거친 기호적 목적 함수의 볼록성을 LLM이 판별할 수 있는지 테스트하기 위해, 확장 가능하고 기계적으로 검증 가능한 벤치마크인 ConvexBench를 소개합니다. 최첨단 LLM들을 대상으로 한 실험 결과, 뚜렷한 구성적 추론(compositional reasoning)의 격차가 드러났습니다. 깊이가 깊어질수록 성능이 급격히 저하되어, 깊이 2에서는 1.0이었던 F1 점수가 깊이 100에서는 약 0.2로 떨어졌습니다. 모델의 추론 과정을 분석한 결과, 파싱 실패(parsing failure)와 게으른 추론(lazy reasoning)이라는 두 가지 실패 양상이 확인되었습니다. 이러한 한계를 해결하기 위해, 우리는 (i) 파싱을 외부 도구에 위임하여 추상 구문 트리(AST)를 생성하고, (ii) 집중된 문맥 내에서 각 중간 하위 표현식에 대해 재귀적 추론을 강제하는 에이전트 기반 분할 정복(divide-and-conquer) 프레임워크를 제안합니다. 이 프레임워크는 깊은 함수 합성에서의 실패를 확실하게 완화하여, 큰 깊이에서도 상당한 성능 향상을 달성했습니다(예: 깊이 100에서 F1 점수 1.0).

Original Abstract

Convex analysis is a modern branch of mathematics with many applications. As Large Language Models (LLMs) start to automate research-level math and sciences, it is important for LLMs to demonstrate the ability to understand and reason with convexity. We introduce \cb, a scalable and mechanically verifiable benchmark for testing \textit{whether LLMs can identify the convexity of a symbolic objective under deep functional composition.} Experiments on frontier LLMs reveal a sharp compositional reasoning gap: performance degrades rapidly with increasing depth, dropping from an F1-score of $1.0$ at depth $2$ to approximately $0.2$ at depth $100$. Inspection of models' reasoning traces indicates two failure modes: \textit{parsing failure} and \textit{lazy reasoning}. To address these limitations, we propose an agentic divide-and-conquer framework that (i) offloads parsing to an external tool to construct an abstract syntax tree (AST) and (ii) enforces recursive reasoning over each intermediate sub-expression with focused context. This framework reliably mitigates deep-composition failures, achieving substantial performance improvement at large depths (e.g., F1-Score $= 1.0$ at depth $100$).

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!