2604.28056v1 Apr 30, 2026 cs.AI

RHyVE: LLM 생성 보상 가설에 대한 역량 기반 검증 및 학습 단계 고려 배포

RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

Fei Wu

Citations: 183

h-index: 4

Hui Li

Citations: 113

h-index: 4

Zhuochen Wang

Citations: 65

h-index: 3

Yican Dai

Citations: 0

h-index: 0

Xuhui Zheng

Citations: 42

h-index: 2

대규모 언어 모델(LLM)은 강화 학습에서의 보상 설계 문제를 훨씬 더 확장 가능하게 만들지만, 생성된 보상은 자동으로 신뢰할 수 있는 학습 목표가 되지는 않습니다. 기존 연구는 주로 보상 후보를 생성, 진화 또는 선택하는 데 초점을 맞추고, 정책 최적화 과정에서 이러한 후보가 언제 검증되고 배포될 수 있는지에 대한 고려는 부족했습니다. 본 연구에서는 생성된 보상을 현재 정책의 역량과 학습 단계에 따라 유용성이 달라지는 보상 가설로 간주하여, 배포 시점 문제를 연구합니다. 본 연구에서는 역량 기반 검증 및 학습 단계 고려 배포 프로토콜인 extsc{RHyVE}를 제안합니다. extsc{RHyVE}는 짧은 시간 내의 분기 검증을 사용하여 공유된 정책 체크포인트에서 생성된 소규모 보상 가설 집합을 비교합니다. 실험 결과, 낮은 역량 단계에서는 보상 순위가 신뢰할 수 없지만, 작업에 따라 정해진 임계값을 넘어서면 유의미한 정보를 제공합니다. 희소 조작 작업에서 학습 단계를 고려한 배포는 고정된 프로토콜 하에서 최고 성능 및 유지 성능을 향상시킵니다. 업데이트된 LLM 생성 보상 후보 실험 결과, 후보 그룹에 따라 다른 동작을 보입니다. 생성된 후보 집합은 학습 단계에 따라 우승 후보가 변경될 수 있지만, 보편적으로 최적의 초기 설정은 존재하지 않습니다. 보류된 일정 선택, 보수적인 선택 기준, 동일한 컴퓨팅 자원 사용 대조군, 그리고 확장성 실험을 통해 extsc{RHyVE}는 보편적인 스케줄러라기보다는 검증 정보를 활용한 배포 프로토콜로 이해하는 것이 가장 적절합니다. 밀집 및 모든 실패 시나리오 실험을 통해 본 방법의 적용 범위를 제한합니다. 이러한 결과들을 종합적으로 고려할 때, 보상 생성과 보상 배포는 결합된 문제로 연구되어야 합니다. 즉, 생성된 보상은 변화하는 정책의 역량에 따라 검증되고 배포되어야 합니다.

Original Abstract

Large language models (LLMs) make reward design in reinforcement learning substantially more scalable, but generated rewards are not automatically reliable training objectives. Existing work has focused primarily on generating, evolving, or selecting reward candidates, while paying less attention to when such candidates can be verified and deployed during policy optimization. We study this deployment-time problem by treating generated rewards as reward hypotheses whose utility depends on the competence of the current policy and the phase of training. We propose \textsc{RHyVE}, a competence-aware verification and phase-aware deployment protocol that compares small sets of reward hypotheses from shared policy checkpoints using short-horizon fork verification. Our experiments show that reward rankings are unreliable at low competence but become informative after task-dependent thresholds. On a sparse manipulation task, phase-aware deployment improves peak and retained performance under a locked protocol. Updated LLM-generated reward-candidate experiments show candidate-family-dependent behavior: generated pools can exhibit phase-dependent winner changes, but no fixed warm-up schedule is universally optimal. Held-out schedule selection, conservative selector baselines, compute-matched controls, and scale controls further show that \textsc{RHyVE} is best understood as a verification-informed deployment protocol rather than a universal scheduler. Dense and all-failure boundary experiments delimit the scope of the method. Together, these results suggest that reward generation and reward deployment should be studied as coupled problems: generated rewards must be verified and deployed under changing policy competence.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!