2601.03205v1 Jan 06, 2026 cs.CL

UltraLogic: 대규모 데이터 합성 및 양극 부동 소수점 보상을 통한 LLM 추론 능력 향상

UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward

Yile Liu

Citations: 6

h-index: 1

Yixian Liu

Citations: 2

h-index: 1

Zong-Rui Li

Citations: 405

h-index: 2

Yufei Huang

Citations: 33

h-index: 3

Xinhua Feng

Citations: 29

h-index: 2

Zhichao Hu

Citations: 41

h-index: 4

Jinglu Hu

Citations: 6

h-index: 1

Jia-Xin Yan

Citations: 9

h-index: 2

Fengzong Lian

Citations: 206

h-index: 9

Yuhong Liu

Citations: 284

h-index: 3

대규모 언어 모델(LLM)은 자연어 처리 분야에서 상당한 잠재력을 보여주었지만, 다단계 논리, 계획, 검증이 필요한 복잡한 일반적인 추론 능력은 여전히 중요한 난관입니다. 강화 학습 기반 검증 보상(RLVR)이 특정 분야에서 성공을 거두었지만, 일반적인 추론을 위한 대규모, 고품질, 난이도 조정된 데이터는 부족합니다. 이러한 문제를 해결하기 위해, 우리는 코드 기반 문제 해결 방법을 통해 문제의 논리적 핵심을 자연어 표현과 분리하여 고품질 데이터 생산을 자동화하는 프레임워크인 UltraLogic을 제안합니다. 이 프레임워크는 수백 가지의 고유한 작업 유형과 십 단계의 난이도를 포괄하는 자동화된 교정 파이프라인을 포함합니다. 또한, 이진 보상의 희소성과 비음수 보상 함정을 완화하기 위해, 우리는 등급별 페널티를 사용하여 완벽한 응답과 논리적 오류가 있는 응답을 효과적으로 구별하는 양극 부동 소수점 보상(BFR) 메커니즘을 도입했습니다. 우리의 실험 결과, 작업 다양성이 추론 능력 향상의 주요 요인이며, BFR과 난이도 매칭 전략을 결합하면 학습 효율성을 크게 향상시켜 모델이 전역적 논리적 최적점에 도달하도록 유도할 수 있음을 보여줍니다.

Original Abstract

While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!