2604.07165v1 Apr 08, 2026 cs.AI

사슬 속의 추론, 나무 속의 학습: 다단계 에이전트 정책 최적화를 위한 자기 교정 및 접목

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Sizhe Tang

Citations: 11

h-index: 2

Tian Lan

Citations: 10

h-index: 2

Yu Li

Citations: 8

h-index: 2

대규모 언어 모델 에이전트를 위한 강화 학습은 종종 다단계 추론 작업에서 희소한 보상으로 인해 어려움을 겪습니다. 기존의 Group Relative Policy Optimization과 같은 방법은 샘플링된 경로를 독립적인 사슬로 취급하며, 각 사슬의 모든 단계에 동일한 가중치를 부여하고 추론 결과에 불균형적인 영향을 미치는 중요한 단계의 존재를 무시합니다. 본 논문에서는 잠재적인 상관 관계가 있는 보상 구조를 복구하는 프레임워크인 T-STAR(Tree-structured Self-Taught Agent Rectification)를 제안합니다. 구체적으로, 우리는 기능적으로 유사한 단계/노드를 식별하고 병합하여 경로를 통합된 인지 트리로 구성합니다. 이를 통해 경로 수준의 보상을 트리 구조를 따라 역전파하여 단계 수준에서 분산 감소된 상대적 이점을 얻는 내부 평가 메커니즘을 구현합니다. 또한, 인지 트리를 사용하여 성공적인 분기와 실패한 분기를 중요한 분기 지점/단계에서 비교하여 수정적인 추론을 합성하는 In-Context Thought Grafting을 개발합니다. 제안하는 Surgical Policy Optimization은 Bradley-Terry 유형의 수술적 손실 함수를 사용하여 이러한 중요한 지점/단계에 집중된 풍부한 정책 기울기 정보를 활용합니다. 다양한 벤치마크(구체적인 예시 포함)를 사용한 광범위한 실험 결과, T-STAR는 강력한 기준 모델보다 일관되게 성능 향상을 보였으며, 특히 연장된 추론 사슬이 필요한 작업에서 더욱 두드러진 성능 향상을 보였습니다.

Original Abstract

Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi-step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionally impact reasoning outcome. In this paper, we propose T-STAR(Tree-structured Self-Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories. Specifically, we consolidate trajectories into a unified Cognitive Tree by identifying and merging functionally similar steps/nodes. It enables an Introspective Valuation mechanism that back-propagates trajectory-level rewards through the tree to obtain a new notion of variance-reduced relative advantage at step-level. Using the Cognitive Tree, we also develop In-Context Thought Grafting to synthesize corrective reasoning by contrasting successful and failed branches at critical divergence points/steps. Our proposed Surgical Policy Optimization then capitalizes on the rich policy gradient information concentrated at these critical points/steps through a Bradley-Terry type of surgical loss. Extensive experiments across embodied, interactive, reasoning, and planning benchmarks demonstrate that T-STAR achieves consistent improvements over strong baselines, with gains most pronounced on tasks requiring extended reasoning chains.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!