2601.14716v1 Jan 21, 2026 cs.LG

PCL-Reasoner-V1.5: 오프라인 강화 학습을 활용한 수학적 추론 능력 향상

PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning

Fan Xu

Citations: 4

h-index: 1

Yao Lu

Citations: 4

h-index: 1

Bin Zhou

Citations: 14

h-index: 2

Dengdong Fan

Citations: 6

h-index: 1

Jianzheng Nie

Citations: 1

h-index: 1

Jie Chen

Citations: 169

h-index: 5

Yonghong Tian

Citations: 7

h-index: 1

본 논문에서는 수학적 추론을 위한 320억 개의 파라미터를 가진 대규모 언어 모델(LLM)인 PCL-Reasoner-V1.5를 소개합니다. 이 모델은 Qwen2.5-32B를 기반으로 구축되었으며, 지도 학습(SFT)과 강화 학습(RL)을 통해 성능을 향상시켰습니다. 핵심적인 혁신은 제안하는 오프라인 강화 학습 방법으로, 이는 GRPO와 같은 기존의 온라인 강화 학습 방법보다 우수한 학습 안정성과 효율성을 제공합니다. PCL-Reasoner-V1.5는 Qwen2.5-32B를 기반으로 추가 학습된 모델 중 최고 수준의 성능을 달성하며, AIME 2024에서 90.9%, AIME 2025에서 85.6%의 평균 정확도를 기록했습니다. 본 연구는 오프라인 강화 학습이 LLM의 추론 능력을 향상시키는 안정적이고 효율적인 방법임을 보여줍니다. 모든 실험은 Huawei Ascend 910C NPU를 사용하여 수행되었습니다.

Original Abstract

We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.

2 Citations

0 Influential

2.5 Altmetric

14.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!