2604.11037v1 Apr 13, 2026 cs.LG

RTMC: 롤아웃 트리 기반 단계별 보상 할당

RTMC: Step-Level Credit Assignment via Rollout Trees

Suhang Zheng

Citations: 108

h-index: 4

Xiaoxiao Xu

Citations: 16

h-index: 2

Tao Wang

Citations: 14

h-index: 2

다단계 에이전트 기반 강화 학습은 세분화된 보상 할당으로부터 이점을 얻을 수 있지만, 기존 방법은 제한적인 선택지를 제공합니다. GRPO와 같은 비평가 기반 방법은 경로 내의 모든 행동에 동일한 이점을 할당하는 반면, 학습된 가치 네트워크는 상당한 오버헤드를 발생시키고 희소 보상 환경에서 불안정할 수 있습니다. 우리는 동일한 문제를 목표로 하는 그룹 롤아웃이 종종 겹치는 중간 상태를 거치며, 이는 연속적인 의사 결정 지점에서 분기되는 트리 구조를 암시적으로 형성한다는 것을 관찰했습니다. 이러한 통찰력을 바탕으로, 우리는 학습된 평가기 없이 단계별 Q 값과 이점을 생성하는 롤아웃 트리 몬테카를로(RTMC) 이점 추정 방법을 제안합니다. 상태-행동 서명 시스템은 원시 상호 작용 기록을 압축하여 간결하고 비교 가능한 표현으로 변환하여, 롤아웃 간의 상태 매칭을 가능하게 합니다. SWE-bench Verified 데이터셋에서 RTMC는 GRPO보다 pass@1 성능을 3.2% 포인트 향상시켰습니다.

Original Abstract

Multi-step agentic reinforcement learning benefits from fine-grained credit assignment, yet existing approaches offer limited options: critic-free methods like GRPO assign a uniform advantage to every action in a trajectory, while learned value networks introduce notable overhead and can be fragile under sparse rewards. We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages--without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!