2604.19730v1 Apr 21, 2026 cs.LG

FASTER: 가치 지향적 샘플링을 이용한 빠른 강화 학습

FASTER: Value-Guided Sampling for Fast RL

Dorsa Sadigh

Citations: 28,195

h-index: 67

Perry Dong

Citations: 113

h-index: 5

Alex Swerdlow

Citations: 110

h-index: 2

Chelsea Finn

Citations: 77

h-index: 5

현재 가장 뛰어난 성능을 보이는 강화 학습 알고리즘 중 일부는 여러 행동 후보를 샘플링하고 최적의 행동을 선택하는 등 테스트 시간 스케일링 방법을 사용하기 때문에 계산 비용이 매우 높을 수 있습니다. 본 연구에서는 FASTER라는 방법을 제안합니다. FASTER는 확산 기반 정책의 샘플링 기반 테스트 시간 스케일링의 이점을 얻으면서도 계산 비용을 줄이는 방법입니다. 이는 행동 샘플의 성능 향상을 노이즈 제거 과정의 초기 단계로 추적합니다. 핵심 아이디어는 여러 행동 후보의 노이즈 제거를 최적의 행동을 선택하는 것으로 모델링하는 것입니다. 이는 목표가 노이즈 제거가 완료되기 전에 행동 후보를 점진적으로 필터링하는 마르코프 결정 프로세스(MDP)로 표현됩니다. 이 MDP를 통해 노이즈 제거 공간에서 정책과 가치 함수를 학습하여 노이즈 제거 과정에서 행동 후보의 후속 값을 예측하고 보상을 최대화하면서 후보를 필터링할 수 있습니다. 그 결과, FASTER는 경량화되어 기존 생성 강화 학습 알고리즘에 쉽게 통합될 수 있습니다. 온라인 및 배치-온라인 강화 학습 환경에서 어려운 장기 조작 작업에서 FASTER는 기본 정책을 지속적으로 개선하고 비교된 방법 중 가장 우수한 전반적인 성능을 달성합니다. 사전 학습된 VLA에 적용했을 때, FASTER는 동일한 성능을 유지하면서 훈련 및 추론에 필요한 계산량을 크게 줄입니다. 코드 및 관련 정보는 https://github.com/alexanderswerdlow/faster 에서 확인할 수 있습니다.

Original Abstract

Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .

0 Citations

0 Influential

53.4657359028 Altmetric

267.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!