2602.10224v1 Feb 10, 2026 cs.LG

대규모 언어 모델의 가이드 강화 학습을 위한 기억 내 메타-경험 내재화

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

Shiting Huang

Citations: 52

h-index: 5

Zecheng Li

Citations: 121

h-index: 4

Yu Zeng

Citations: 172

h-index: 8

Qingnan Ren

Citations: 5

h-index: 1

Zhen Fang

Citations: 61

h-index: 5

Qisheng Su

Citations: 41

h-index: 3

K. Shi

Citations: 2

h-index: 1

Lin Chen

Citations: 2,336

h-index: 8

Zehui Chen

Citations: 1,770

h-index: 9

Feng Zhao

Citations: 565

h-index: 8

검증 가능한 보상을 이용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 효과적인 방법으로 부상했습니다. 하지만 RLVR은 메타 학습의 한계를 가지고 있는데, 이는 인간 학습 과정에서 연습과 검증 외에 오류 귀속 및 경험 내재화 메커니즘이 부족하여 세밀한 보상 할당 및 재사용 가능한 지식 형성을 제한하기 때문입니다. 우리는 과거 오류로부터 파생된 재사용 가능한 지식 표현을 '메타-경험'이라고 명명합니다. 이러한 통찰력을 바탕으로, 우리는 모델의 매개변수 메모리에 자체적으로 추출된 메타-경험을 통합하는 새로운 프레임워크인 메타-경험 학습(MEL)을 제안합니다. 표준 RLVR을 기반으로, 우리는 LLM의 자체 검증 능력을 활용하여 정답 및 오답 트레이너리를 비교 분석하고, 추론 오류가 발생하는 정확한 분기 지점을 식별하여 일반화 가능한 메타-경험으로 요약하는 추가적인 설계를 도입합니다. 추출된 메타-경험은 음의 로그-우도를 최소화하여 LLM의 매개변수 메모리에 내재화되며, 이를 통해 올바른 추론 경로와 잘못된 추론 경로를 연결하는 언어 모델 기반 보상 신호를 유도하여 효과적인 지식 재사용을 촉진합니다. 실험 결과, MEL은 다양한 모델 크기에서 3.92%에서 4.73%의 Pass@1 성능 향상을 보여 벤치마크에서 일관된 성능 향상을 달성했습니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!