2601.00167v1 Jan 01, 2026 cs.LG

순수 강화학습 기울기를 이용한 온라인 파인튜닝 Decision Transformers

Online Finetuning Decision Transformers with Pure RL Gradients

Citations: 15

h-index: 2

Citations: 94

h-index: 5

Decision Transformers (DTs)는 오프라인 강화학습(RL)을 시퀀스 모델링 문제로 공식화하여 순차적 의사 결정에 강력한 프레임워크로 등장했습니다. 그러나 기존 접근 방식은 온라인 파인튜닝 과정에서 여전히 지도적 시퀀스 모델링 목표에 크게 의존하기 때문에, 순수 강화학습 기울기를 사용한 DT의 온라인 적용은 아직 널리 연구되지 않았습니다. 본 연구에서는 온라인 DT의 중요한 구성 요소인 '힌트사이트 리턴 리레이블링'이 강화학습 기반 파인튜닝에 중요한 장애물이 된다는 것을 밝히고, 이는 지도 학습에는 유용하지만, GRPO와 같은 중요 샘플링 기반 강화학습 알고리즘과 근본적으로 호환되지 않아 불안정한 학습을 유발한다는 것을 보여줍니다. 이러한 통찰력을 바탕으로, 순수 강화학습 기울기를 사용하여 Decision Transformers의 온라인 파인튜닝을 가능하게 하는 새로운 알고리즘을 제안합니다. 우리는 GRPO를 DT에 적용하고, 향상된 신용 할당을 위한 서브 트래jectory 최적화, 안정성과 효율성을 향상시키는 시퀀스 레벨 likelihood 목표, 그리고 불확실한 영역에서의 탐색을 장려하는 액티브 샘플링과 같은 핵심적인 수정 사항을 도입했습니다. 광범위한 실험을 통해, 제안하는 방법이 기존의 온라인 DT 기준 성능을 능가하며, 여러 벤치마크에서 새로운 최고 성능을 달성한다는 것을 입증했습니다. 이는 Decision Transformers의 온라인 파인튜닝에 순수 강화학습 기반 접근 방식이 효과적임을 보여줍니다.

Original Abstract

Decision Transformers (DTs) have emerged as a powerful framework for sequential decision making by formulating offline reinforcement learning (RL) as a sequence modeling problem. However, extending DTs to online settings with pure RL gradients remains largely unexplored, as existing approaches continue to rely heavily on supervised sequence-modeling objectives during online finetuning. We identify hindsight return relabeling -- a standard component in online DTs -- as a critical obstacle to RL-based finetuning: while beneficial for supervised learning, it is fundamentally incompatible with importance sampling-based RL algorithms such as GRPO, leading to unstable training. Building on this insight, we propose new algorithms that enable online finetuning of Decision Transformers using pure reinforcement learning gradients. We adapt GRPO to DTs and introduce several key modifications, including sub-trajectory optimization for improved credit assignment, sequence-level likelihood objectives for enhanced stability and efficiency, and active sampling to encourage exploration in uncertain regions. Through extensive experiments, we demonstrate that our methods outperform existing online DT baselines and achieve new state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of pure-RL-based online finetuning for Decision Transformers.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!