2602.17931v1 Feb 20, 2026 cs.LG

LLM 가이드 강화학습을 위한 메모리 기반 어드밴티지 셰이핑

Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning

Citations: 5,021

h-index: 35

Citations: 10

h-index: 2

희소하거나 지연된 보상이 주어지는 환경에서 강화학습(RL)은 학습에 필요한 수많은 상호작용으로 인해 높은 샘플 복잡도를 발생시킨다. 이러한 한계는 하위 목표(subgoal) 발견 및 궤적(trajectory) 안내에 거대 언어 모델(LLM)을 활용하게 하는 동기가 되었다. LLM이 탐색을 지원할 수는 있지만, 잦은 LLM 호출 의존은 확장성과 신뢰성에 대한 우려를 낳는다. 우리는 LLM의 안내와 에이전트 자체의 성공적인 롤아웃으로부터 하위 목표와 궤적을 인코딩하는 메모리 그래프를 구축하여 이러한 문제를 해결한다. 이 그래프로부터 우리는 에이전트의 궤적이 이전의 성공적인 전략과 얼마나 밀접하게 일치하는지를 평가하는 효용 함수를 도출한다. 이 효용은 어드밴티지 함수(advantage function)를 형성하여, 보상 체계를 변경하지 않고도 크리틱(critic)에게 추가적인 지침을 제공한다. 제안하는 방법은 주로 오프라인 입력과 간헐적인 온라인 쿼리에 의존하여 지속적인 LLM 감독에 대한 의존성을 피한다. 벤치마크 환경에서의 예비 실험 결과, 베이스라인 RL 기법에 비해 샘플 효율성이 향상되고 초기 학습이 더 빠르게 이루어졌으며, 빈번한 LLM 상호작용을 요구하는 방법들과 필적하는 최종 수익(return)을 달성함을 보여주었다.

Original Abstract

In environments with sparse or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance. While LLMs can support exploration, frequent reliance on LLM calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent's own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent's trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online queries, avoiding dependence on continuous LLM supervision. Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns comparable to methods that require frequent LLM interaction.

0 Citations

0 Influential

17.5 Altmetric

87.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!