2606.10479v1 Jun 09, 2026 cs.AI

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Xiaoye Qu
Xiaoye Qu
Citations: 1,394
h-index: 19
Qianjia Cheng
Qianjia Cheng
Citations: 63
h-index: 4
Yuchen Su
Yuchen Su
Citations: 6
h-index: 2
Yafu Li
Yafu Li
Citations: 370
h-index: 8
Yu Cheng
Yu Cheng
Citations: 122
h-index: 5
Haoran Zhang
Haoran Zhang
Citations: 9
h-index: 2
Zhilin Wang
Zhilin Wang
Citations: 229
h-index: 5
Shunkai Zhang
Shunkai Zhang
Citations: 7
h-index: 1
Hao Lei
Hao Lei
Citations: 9
h-index: 2
Xinmiao Han
Xinmiao Han
Citations: 0
h-index: 0
Zhouchen Lin
Zhouchen Lin
Citations: 60
h-index: 3
Yunmeng Luo
Yunmeng Luo
Citations: 0
h-index: 0
Yizhuo Li
Yizhuo Li
Citations: 3,553
h-index: 7
Runzhe Zhan
Runzhe Zhan
University of Macau
Citations: 592
h-index: 11
Bangjie Xu
Bangjie Xu
Citations: 2
h-index: 1
Dongrui Liu
Dongrui Liu
Citations: 4
h-index: 1
Yueru Qiao
Yueru Qiao
Citations: 0
h-index: 0
Ning Ding
Ning Ding
Citations: 23
h-index: 2

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

0 Citations
0 Influential
9.5 Altmetric
47.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!