2602.00400v1 Jan 30, 2026 cs.AI

KEPO: 지식 기반 선호도 최적화를 통한 추론 능력을 갖춘 강화 학습

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

Fan Yang

Citations: 564

h-index: 8

Yuxin Wen

Citations: 48

h-index: 4

Rui Meng

Citations: 19

h-index: 3

Trudi Di Qi

Citations: 30

h-index: 3

A. Ezzati

Citations: 26

h-index: 3

강화 학습(RL)은 대규모 언어 모델 및 시각-언어 모델에서 명시적인 추론 행동을 유도하는 유망한 방법론으로 부상했습니다. 그러나 추론 중심의 강화 학습 후처리 과정은 희소한 경로 수준의 보상으로 인해 근본적인 어려움을 겪으며, 이는 모호한 보상 할당 및 심각한 탐색 실패를 야기하여 정책을 '학습 절벽'에 빠뜨릴 수 있습니다. 최근의 온-정책 증류 방법은 안정적인 최적화를 위해 밀집된 교사 감독을 도입하지만, 이를 생성된 모든 경로에 대해 균일하게 적용합니다. 우리는 이러한 균일한 증류가 추론 집약적인 작업에 적합하지 않다고 주장합니다. 왜냐하면 저품질의 온-정책 경로는 종종 초기 논리적 오류에서 비롯되며, 잘못된 맥락에서의 증류는 노이즈가 많고 일관되지 않은 기울기를 주입하기 때문입니다. 이러한 문제를 해결하기 위해, 우리는 지식 기반 선호도 최적화(KEPO)라는 통합된 후처리 프레임워크를 제안합니다. KEPO는 다음과 같은 요소를 포함합니다. (i) 품질 게이트를 적용한 온-정책 증류 목표로, 밀집된 교사 지침을 고품질 경로에만 선택적으로 적용하고, (ii) 교사 모델에서 학습된 힌트를 활용하여 강화 학습을 위한 보상 긍정적인 온-정책 경로를 선택적으로 샘플링하는 지식 기반 탐색 전략을 통해 탐색 실패를 완화합니다. 단일 소스 일반화 환경에서 어려운 의료 시각 질의 응답 벤치마크에서 KEPO는 강화 학습 및 온-정책 증류 기준 모델보다 향상된 학습 안정성, 더욱 일관된 추론 행동, 그리고 우수한 일반화 성능을 보여줍니다.

Original Abstract

Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.'' Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!