2604.12627v1 Apr 14, 2026 cs.AI

KnowRL: 최소한의 충분한 지식 지침을 활용한 강화 학습을 통해 LLM의 추론 능력 향상

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

Hua Wu

Citations: 284

h-index: 10

Shuaiyi Nie

Citations: 55

h-index: 5

Naibin Gu

Citations: 93

h-index: 5

Linhao Yu

Citations: 489

h-index: 9

Renren Jin

Citations: 985

h-index: 12

D. Xiong

Citations: 59

h-index: 4

Weichong Yin

Citations: 837

h-index: 9

Yu Sun

Citations: 17

h-index: 2

Tianmeng Yang

Citations: 61

h-index: 5

Siyu Ding

Citations: 110

h-index: 4

Xiangzhao Hao

Citations: 26

h-index: 4

RLVR은 대규모 언어 모델의 추론 능력을 향상시키지만, 어려운 문제에서 심각한 보상 희소성으로 인해 그 효과가 제한되는 경우가 많습니다. 최근의 힌트 기반 강화 학습 방법은 부분적인 해결책이나 추상적인 템플릿을 주입하여 희소성을 완화하지만, 일반적으로 지침을 확장하는 과정에서 더 많은 토큰을 추가하여 중복, 불일치 및 추가적인 학습 부담을 초래합니다. 본 연구에서는 지침 설계를 최소한의 충분한 지침 문제로 간주하는 강화 학습 프레임워크인 KnowRL(Knowledge-Guided Reinforcement Learning)을 제안합니다. KnowRL은 강화 학습 훈련 과정에서 지침을 원자적인 지식 포인트(KP)로 분해하고, Constrained Subset Search(CSS)를 사용하여 훈련에 필요한 작고 상호 작용에 민감한 부분 집합을 구성합니다. 또한, 하나의 KP를 제거하는 것은 도움이 될 수 있지만, 여러 개의 KP를 제거하는 것은 해가 될 수 있는 '가지치기 상호 작용 역설'을 식별하고, 이러한 의존성 구조를 고려하여 안정적인 부분 집합 큐레이션을 위한 최적화를 수행합니다. KnowRL-Nemotron-1.5B 모델은 OpenMath-Nemotron-1.5B 모델을 기반으로 훈련되었습니다. 1.5B 규모의 8가지 추론 벤치마크에서 KnowRL-Nemotron-1.5B는 강력한 강화 학습 및 힌트 기반 모델들을 꾸준히 능가하는 성능을 보였습니다. 추론 시 KP 힌트 없이 KnowRL-Nemotron-1.5B는 평균 정확도 70.08%를 달성하여 Nemotron-1.5B보다 +9.63% 높은 성능을 보입니다. 선택된 KP를 사용할 경우, 성능은 74.16%로 향상되어 해당 규모에서 새로운 최고 수준을 달성합니다. 모델, 큐레이션된 훈련 데이터 및 코드는 https://github.com/Hasuer/KnowRL 에서 공개적으로 이용할 수 있습니다.

Original Abstract

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \textbf{KnowRL} (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.

5 Citations

0 Influential

45.944920232821 Altmetric

234.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!