2605.07147v1 May 08, 2026 cs.LO

MathlibPR: 형식 수학 라이브러리의 풀 리퀘스트 병합 준비도 벤치마크

MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

Xinyu Liu

Citations: 50

h-index: 5

Zixuan Xie

Citations: 44

h-index: 4

Shangtong Zhang

Citations: 30

h-index: 3

Lean 및 Mathlib 생태계는 최근 몇 년 동안 대규모 언어 모델(LLM) 기반 형식 추론 분야에서 놀라운 성공을 거두며 사실상 표준으로 자리 잡았습니다. 그러나 이러한 성공은 Mathlib을 필수적인 의존성으로만 활용하며, Mathlib 자체에 직접적인 기여를 하지 않습니다. 한편, Mathlib의 성장은 최근 제안된 풀 리퀘스트(PR)가 Mathlib의 규칙을 준수하고 공유된 수학 인프라의 일부로 통합될 가치가 있는지 여부를 판단하는 인간 검토 과정으로 인해 병목 현상을 겪고 있습니다. 이러한 상황에서 우리는 다음과 같은 핵심 질문을 던집니다. LLM이 Mathlib PR 검토에 도움이 될 수 있을까요? 이를 위해 실제 Mathlib4 PR 이력을 기반으로 구축된 벤치마크인 MathlibPR을 소개합니다. 또한, 단계별 평가 프로토콜을 제안하고 이를 사용하여 다양한 LLM 모델(예: DeepSeek, Qwen, Goedel, Kimina) 및 LLM 에이전트(예: Codex, Claude Code)를 평가했습니다. 놀랍게도, LLM 모델과 LLM 에이전트 모두 병합 가능한 PR과 빌드 통과 후 수정되거나 병합되지 않은 PR을 구별하는 데 어려움을 겪었습니다. Mathlib PR 이력을 지도 학습 신호로 활용함으로써, MathlibPR은 PR 검토를 지원하고 LLM이 병합 가능한 Mathlib 기여를 생성하도록 유도할 수 있는 검토 보조 도구 및 보상 모델 개발을 위한 중요한 단계입니다.

Original Abstract

The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib's conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code). Surprisingly, both LLM models and LLM agents struggle to distinguish merge-ready PRs from build-passing PRs that were revised or never merged. By turning Mathlib PR histories into a supervised signal, MathlibPR provides a step toward reviewer assistants and reward models that could help evaluate PRs and steer LLMs toward producing merge-ready Mathlib contributions.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!