2601.03703v1 Jan 07, 2026 cs.LG

TreeAdv: 트리 구조 기반 이점 재분배를 통한 그룹 기반 강화 학습

TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL

Lang Cao

Citations: 93

h-index: 5

Haonan Song

Citations: 8

h-index: 1

Hui Ruan

Citations: 72

h-index: 4

Yongqiang Li

Citations: 156

h-index: 6

Peng Chao

Citations: 0

h-index: 0

Wu Ning

Citations: 0

h-index: 0

Renhong Chen

Citations: 26

h-index: 2

Yitong Li

Citations: 14

h-index: 2

그룹 기반 목표를 가진 강화 학습, 예를 들어 그룹 상대 정책 최적화(GRPO)는 대규모 언어 모델을 복잡한 추론 작업에 맞추는 데 널리 사용되는 프레임워크입니다. 그러나 표준 GRPO는 각 실행 경로를 독립적인 평면 시퀀스로 취급하고 모든 토큰에 단일 시퀀스 수준의 이점을 할당하므로, 샘플 효율성이 떨어지고 논리적 깊이를 향상시키지 않고도 장황하고 중복적인 사고 과정을 선호하는 경향이 있습니다. 본 논문에서는 탐색 및 이점 할당 모두에 그룹 실행의 트리 구조를 명시적으로 활용하는 TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL)를 소개합니다. 구체적으로, TreeAdv는 엔트로피 기반 샘플링 방법을 사용하여 그룹의 트리(숲)를 구축하며, 각 트리는 불확실성이 높은 결정 지점에서 분기하고 낮은 불확실성을 가진 토큰을 실행 간에 공유합니다. 그런 다음 TreeAdv는 전체 실행(모든 리프 노드)의 이점을 내부 트리 세그먼트의 토큰 수준 이점으로 재분배하여 집계합니다. TreeAdv는 GRPO 또는 GSPO와 같은 그룹 기반 목표에 쉽게 적용할 수 있습니다. 10개의 수학적 추론 벤치마크에서 TreeAdv는 GRPO 및 GSPO보다 일관되게 우수한 성능을 보였으며, 동일한 감독, 데이터 및 디코딩 예산을 사용하면서 훨씬 적은 수의 토큰을 생성했습니다.

Original Abstract

Reinforcement learning with group-based objectives, such as Group Relative Policy Optimization (GRPO), is a common framework for aligning large language models on complex reasoning tasks. However, standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens, which leads to sample inefficiency and a length bias toward verbose, redundant chains of thought without improving logical depth. We introduce TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL), which makes the tree structure of group rollouts explicit for both exploration and advantage assignment. Specifically, TreeAdv builds a group of trees (a forest) based on an entropy-driven sampling method where each tree branches at high-uncertainty decisions while sharing low-uncertainty tokens across rollouts. Then, TreeAdv aggregates token-level advantages for internal tree segments by redistributing the advantages of complete rollouts (all leaf nodes), and TreeAdv can easily apply to group-based objectives such as GRPO or GSPO. Across 10 math reasoning benchmarks, TreeAdv consistently outperforms GRPO and GSPO, while using substantially fewer generated tokens under identical supervision, data, and decoding budgets.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!