2603.24093v1 Mar 25, 2026 cs.LG

효과적인 경험 기반 학습을 향하여: 활용 및 내면화를 위한 이중 지침

Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization

Jian Yang

Citations: 30

h-index: 3

Chuan Hao

Citations: 49

h-index: 3

Ran Tao

Citations: 19

h-index: 3

Ming Yang

Citations: 34

h-index: 2

Fei Bai

Citations: 108

h-index: 4

Zhipeng Chen

Citations: 1,204

h-index: 15

Bryan Dai

Citations: 250

h-index: 4

Wayne Xin Zhao

Citations: 276

h-index: 4

Hongteng Xu

Citations: 64

h-index: 4

최근, 강화 학습(RL)은 대규모 언어 모델(LLM)의 성능을 향상시키는 중요한 접근 방식으로 자리 잡았습니다. 특히, 검증 가능한 보상을 활용한 강화 학습(RLVR)은 추론 작업에 있어 유망한 패러다임으로 부상했습니다. 그러나 기존의 RL 기반 학습은 여전히 인간 학습의 대략적인 근사치에 불과합니다. 인간 학습자는 외부 경험과 내부 경험을 모두 활용하여 탐색을 안내하고, 유용한 경로를 점진적으로 안정적인 지식으로 내면화합니다. 이러한 간극에 주목하여, 우리는 다음과 같은 질문을 던집니다. LLM이 RLVR 학습 과정에서 경험을 어떻게 더 효과적으로 활용하고 내면화할 수 있을까요? 이 질문에 답하기 위해, 우리는 extbf{D}ual extbf{G}uidance extbf{O}ptimization( extbf{DGO})이라는 통합 프레임워크를 제안합니다. DGO는 extit{외부} 및 extit{내부 경험}을 활용하여 학습 효과를 향상시킵니다. 구체적으로, DGO는 먼저 이전에 탐색된 경로로부터 경험 저장소를 구축합니다. 정책은 이 경험 저장소와 모델의 내부 지식의 공동 지침 하에 탐색을 수행합니다. 결과적으로 생성된 경로는 경험 저장소를 개선하고 모델 파라미터를 최적화하는 데 추가적으로 사용되어, 경험 활용 및 내면화의 폐쇄 루프를 형성합니다. 실험 결과는 DGO가 기존 방법보다 일관되게 우수한 성능을 보임을 보여주며, 이는 경험의 더 나은 활용 및 내면화가 더 효과적인 추론으로 이어질 수 있음을 시사합니다.

Original Abstract

Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable knowledge. Motivated by this gap, we ask: how can LLMs better utilize and internalize experience during RLVR training? To answer this question, we propose \textbf{D}ual \textbf{G}uidance \textbf{O}ptimization~(\textbf{DGO}), a unified framework that leverages \emph{external} and \emph{internal experience} to improve training effectiveness. Specifically, DGO first constructs an experience bank from previously explored trajectories. The policy then performs exploration under the joint guidance of the experience bank and the model's internal knowledge. The resulting trajectories are further used to refine the experience bank and optimize model parameters, forming a closed loop of experience utilization and internalization. Experiments show that DGO consistently outperforms baseline methods, suggesting that better utilization and internalization of experience lead to more effective reasoning.

2 Citations

0 Influential

7.5 Altmetric

39.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!