2603.28716v1 Mar 30, 2026 cs.AI

에이전트 기반 강화 학습을 위한 동적 이중 수준 기술 저장소

Dynamic Dual-Granularity Skill Bank for Agentic RL

Dongbin Zhao

Citations: 103

h-index: 5

Dong Li

Citations: 175

h-index: 5

Songjun Tu

Citations: 256

h-index: 9

Chengdong Xu

Citations: 26

h-index: 3

Qichao Zhang

Citations: 2,837

h-index: 27

Yaocheng Zhang

Citations: 72

h-index: 4

Xiangyuan Lan

Citations: 126

h-index: 5

Linjing Li

Citations: 117

h-index: 4

에이전트 기반 강화 학습(RL)은 재사용 가능한 경험으로부터 상당한 이점을 얻을 수 있지만, 기존의 기술 기반 방법은 주로 트레이저리 레벨의 지침을 추출하며 종종 진화하는 기술 메모리를 유지하기 위한 체계적인 메커니즘이 부족합니다. 본 논문에서는 에이전트 기반 강화 학습을 위한 동적 이중 수준 기술 저장소인 D2Skill을 제안합니다. D2Skill은 재사용 가능한 경험을 고수준 지침을 위한 작업 기술과 미세한 의사 결정 지원 및 오류 수정 기능을 위한 단계별 기술로 구성하여 관리합니다. D2Skill은 동일한 정책 하에서 쌍을 이루는 기준 실행과 기술 주입 실행을 동시에 훈련하여, 성능 차이를 활용하여 기술 업데이트 및 정책 최적화를 위한 후회 기반 유틸리티 신호를 도출합니다. D2Skill은 훈련 과정에서 얻은 경험만을 사용하여 구축되며, 지속적인 성찰을 통해 기술 저장소를 확장하고 유틸리티를 고려한 검색 및 가지치기를 통해 유지 관리합니다. Qwen2.5-7B-Instruct 및 Qwen3-4B-Instruct-2507을 사용하여 ALFWorld 및 WebShop 환경에서 수행한 실험 결과, D2Skill은 기술을 사용하지 않는 기준 모델에 비해 일관적으로 성공률을 10-20%p 향상시켰습니다. 추가적인 분석 결과, 이중 수준의 기술 모델링과 동적인 기술 유지 관리 모두 이러한 성능 향상에 중요한 역할을 하며, 학습된 기술은 더 높은 유틸리티를 가지며, 평가 환경 간에 전이 가능하고, 훈련 오버헤드가 미미하다는 것을 확인했습니다.

Original Abstract

Agentic reinforcement learning (RL) can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show that D2Skill consistently improves success rates over skill-free baselines by 10-20 points. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.

17 Citations

4 Influential

13.5 Altmetric

92.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!