2603.21065v1 Mar 22, 2026 cs.AI

LongCat-Flash-Prover: 에이전트 기반 도구 통합 강화 학습을 통한 형식적 추론 능력 향상

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Peng Pei

Citations: 258

h-index: 10

Xunliang Cai

Citations: 74

h-index: 5

Dengchang Zhao

Citations: 111

h-index: 4

Linsen Guo

Citations: 175

h-index: 6

Meng Shen

Citations: 149

h-index: 4

Xiangyu Xi

Citations: 105

h-index: 5

Yunke Zhao

Citations: 108

h-index: 4

Zhimin Lin

Citations: 25

h-index: 2

Chong Peng

Citations: 42

h-index: 4

Wei Wang

Citations: 3,601

h-index: 8

Jianfei Zhang

Citations: 50

h-index: 3

Cunguang Wang

Citations: 36

h-index: 3

Qianyun Guo

Citations: 12

h-index: 2

Shengnan An

Xi'an Jiaotong University

Citations: 702

h-index: 10

Qi Li

Citations: 95

h-index: 5

Yi Lu

Citations: 20

h-index: 3

Wei Shi

Citations: 36

h-index: 3

Jianing Wang

Citations: 31

h-index: 3

Rumei Li

Citations: 13

h-index: 3

Chao Zhang

Citations: 1,190

h-index: 4

Jia-Liang Shi

Citations: 24

h-index: 3

Jingang Wang

Citations: 61

h-index: 3

Liulin Feng

Citations: 4

h-index: 1

Shuna Wang

Citations: 29

h-index: 3

Xiaoyu Li

Citations: 113

h-index: 6

Xuezhi Cao

Citations: 56

h-index: 4

Zhengyu Chen

Citations: 83

h-index: 5

본 논문에서는 LongCat-Flash-Prover를 소개합니다. LongCat-Flash-Prover는 5600억 개의 파라미터를 가진 오픈 소스 Mixture-of-Experts (MoE) 모델로서, 에이전트 기반 도구 통합 추론(TIR)을 통해 Lean4 환경에서 형식적 추론 능력을 향상시킵니다. 우리는 형식적 추론 작업을 자동 형식화, 스케치, 증명이라는 세 가지 독립적인 형식적 능력으로 분해했습니다. 이러한 능력을 지원하기 위해, 고품질의 작업 경로를 확장하는 하이브리드 전문가 반복 프레임워크를 제안합니다. 이는 주어진 비형식적 문제로부터 형식적인 명제를 생성하거나, 명제로부터 전체 증명을 직접 생성하거나, 보조정리 형태의 스케치를 생성하는 것을 포함합니다. 에이전트 기반 강화 학습 과정에서, 우리는 계층적 중요 샘플링 정책 최적화(HisPO) 알고리즘을 제시합니다. 이 알고리즘은 MoE 모델의 장기적인 학습을 안정화시키는 것을 목표로 하며, 정책의 노후화 및 시퀀스 및 토큰 수준에서의 훈련-추론 엔진 간의 불일치를 고려하는 그래디언트 마스킹 전략을 사용합니다. 또한, 보조정리의 일관성 및 유효성 검사 메커니즘을 통합하여 보상 해킹 문제를 해결합니다. 광범위한 실험 결과, LongCat-Flash-Prover는 자동 형식화 및 정리 증명 분야에서 오픈 소스 모델의 새로운 최고 성능을 달성했습니다. 뛰어난 샘플 효율성을 보여주며, MiniF2F-Test에서 문제당 72개의 추론 예산만으로 97.1%의 성공률을 달성했습니다. 더 어려운 벤치마크에서는 ProverBench의 70.8%와 PutnamBench의 41.5%를 문제당 220회 시도 이내에 해결하며, 기존의 오픈 소스 모델보다 훨씬 뛰어난 성능을 보였습니다.

Original Abstract

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

4 Citations

2 Influential

5 Altmetric

33.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!