2605.06078v1 May 07, 2026 cs.CL

장기 목표를 가진 언어 에이전트를 위한 이정표 기반 정책 학습

Milestone-Guided Policy Learning for Long-Horizon Language Agents

Yuchen Yan

Zhejiang University

Citations: 490

h-index: 13

Yueting Zhuang

Citations: 610

h-index: 14

Yongliang Shen

Citations: 397

h-index: 10

Dingming Li

Citations: 111

h-index: 3

Weiming Lu

Citations: 128

h-index: 5

Zixuan Wang

Citations: 112

h-index: 3

Hongxing Li

Citations: 145

h-index: 4

Tengteng Pan

Citations: 3

h-index: 1

Jun Xiao

Citations: 49

h-index: 4

Ruiqing Zhang

Citations: 250

h-index: 7

장기 목표를 가진 에이전트 작업은 언어 에이전트가 수십 번의 연속적인 의사 결정을 수행해야 하지만, 강화 학습을 통해 이러한 에이전트를 훈련하는 것은 여전히 어려운 과제입니다. 우리는 두 가지 근본적인 원인을 파악했습니다. 첫째, '신용 오인', 즉 초기 단계의 올바른 행동이 최종 실패로 인해 부정적으로 평가되는 경우입니다. 둘째, '샘플 비효율성', 즉 성공적인 경로가 부족하여 학습 신호의 대부분이 손실되는 경우입니다. 우리는 BEACON이라는 이정표 기반 정책 학습 프레임워크를 소개합니다. BEACON은 장기 작업의 구조적 특성을 활용하여 정확한 신용 할당을 보장합니다. BEACON은 경로를 이정표 경계에서 분할하고, 각 구간 내에서 시간적 보상 형성을 적용하여 부분적인 진행 상황에 대한 신용을 부여하며, 두 가지 규모에서 이점을 추정하여 먼 거리의 실패가 로컬 행동의 평가를 손상시키지 않도록 합니다. ALFWorld, WebShop, ScienceWorld에서 BEACON은 GRPO 및 GiGPO보다 꾸준히 우수한 성능을 보였습니다. 특히, 장기 ALFWorld 작업에서 BEACON은 92.9%의 성공률을 달성하여 GRPO의 53.5%에 비해 거의 두 배에 달했으며, 효과적인 샘플 활용률을 23.7%에서 82.0%로 향상시켰습니다. 이러한 결과는 이정표 기반 신용 할당이 장기 목표를 가진 언어 에이전트를 훈련하는 효과적인 방법임을 입증합니다. 코드는 https://github.com/ZJU-REAL/BEACON에서 확인할 수 있습니다.

Original Abstract

While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at https://github.com/ZJU-REAL/BEACON.

1 Citations

0 Influential

43.290482690107 Altmetric

217.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!