2605.02469v1 May 04, 2026 cs.LG

KL 정규화 강화 학습 기반 가치 추정(RLVR)을 위한 참조 샘플링 볼츠만 투영: 목표와 일치하는 가중치 지도 학습 미세 조정, 유한한 단일 샷 성능 격차, 정책 미러 하강

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

Hui Xiong

Citations: 2

h-index: 1

Shuang Qiu

Citations: 1

h-index: 1

Chenxing Wei

Citations: 59

h-index: 4

Yao Shu

Citations: 29

h-index: 3

Hongbin Lin

Citations: 51

h-index: 2

검증 가능한 보상을 활용하는 온라인 강화 학습(RLVR)은 검증 가능한 결과를 확장 가능한 학습 신호로 변환하지만, 롤아웃 생성, 검증기 점수 및 참조 정책 평가를 최적화 경로에 포함합니다. 미리 계산된 롤아웃에 대한 정적 가중치 지도 학습 미세 조정(SFT)은 이러한 병목 현상을 제거하는 것처럼 보이지만, 가중치 가능도는 보상만으로는 명확하게 정의되지 않습니다. 이 가능도의 샘플러와 가중치는 학습되는 정책에 영향을 미칩니다. 본 논문에서는 유도된 정책이 고정된 참조 KL 정규화 RLVR 최적화기와 동일한 참조 샘플링 가중치 SFT 목표를 제시합니다. 이 최적화기는 검증기 보상에 의해 지수적으로 기울어진 참조 정책을 얻은 표준 볼츠만 목표 정책입니다. 가중치 SFT에 의해 유도된 정책을 이 목표 정책에 맞추면 밀도 비율 가중치가 필요하며, 참조 샘플링 하위 클래스에서는 프롬프트 스케일링까지 고유하게 프롬프트 정규화 볼츠만 가중치 $rac{ ext{exp}(r(x,y)/β)}{Z(x)}$로 줄어듭니다. BOLT는 이 투영의 경험적 추정량인 볼츠만 목표 SFT 절차입니다. 유한한 단일 샷 분석은 정확한 저장 지원 가격 $β ext{log}(1/π^*(S_N ext{mid} x))$을 파티션 추정, 유효 샘플 크기 분산, 일반화, 최적화 및 근사 오류로부터 분리합니다. 이러한 분해는 왜 추가 SFT 에포크가 누락된 참조 정책 적용 범위를 보완할 수 없는지 설명하며, 온도-적용 범위-분산 경계를 드러냅니다. 적용 범위가 적응적 샘플링을 필요로 할 때, 새로 고쳐진 볼츠만 투영은 KL 정책 미러 하강이 되며, 유한한 내부 솔루션은 정확한 미러 단계에서 발생하는 추가적인 드리프트로 포함됩니다. 단일 실행 Qwen 실험은 목표와 일치하는 가중치, 단일 샷 포화, 새로 고쳐진 샘플러의 이점 및 최적화 시간 절약에 대한 투영 증거를 제공하며, 이는 명시된 단일 실행 범위 내에서 이루어진 결과입니다.

Original Abstract

Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight $\exp(r(x,y)/β)/Z(x)$. BOLT, a Boltzmann-Targeted SFT procedure, is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price $β\log(1/π^*(S_N\mid x))$ from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. This decomposition explains why extra SFT epochs cannot repair missing reference-policy coverage and exposes the temperature--coverage--variance frontier. When coverage needs adaptive sampling, refreshed Boltzmann projections become KL policy mirror descent; finite inner solves enter as additive drift from the exact mirror step. Single-run Qwen experiments provide projection evidence for the target-matched weight, one-shot saturation, refreshed-sampler gains, and optimization-time savings, within the stated single-run scope.

1 Citations

0 Influential

2 Altmetric

11.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!