2602.19313v1 Feb 22, 2026 cs.RO

TOPReward: 로봇공학을 위한 숨겨진 제로샷 보상으로서의 토큰 확률

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Shirui Chen

University of Washington, Seattle

Citations: 43

h-index: 4

Cole Harrison

Citations: 5

h-index: 1

Ying-Chun Lee

Citations: 6

h-index: 1

Lillian J. Ratliff

Citations: 12

h-index: 2

Dieter Fox

Citations: 822

h-index: 9

Ranjay Krishna

Citations: 2,879

h-index: 22

Jiafei Duan

University of Washington

Citations: 1,602

h-index: 14

Angela Yang

Citations: 18

h-index: 2

Zhongzheng Ren

Citations: 260

h-index: 4

비전-언어-행동(VLA) 모델이 사전 훈련에서 빠른 진전을 보였음에도 불구하고, 실제 환경에서는 낮은 샘플 효율성과 희소한 보상으로 인해 강화 학습(RL)에서의 발전이 지연되고 있다. 이러한 간극을 메우는 데 필요한 세밀한 피드백을 제공하기 위해서는 일반화 가능한 프로세스 보상 모델을 개발하는 것이 필수적이지만, 기존의 시간적 가치 함수는 종종 훈련 도메인을 넘어서 일반화하는 데 실패한다. 우리는 로봇 작업의 진행률을 추정하기 위해 사전 훈련된 비디오 비전-언어 모델(VLM)의 잠재적 세계 지식을 활용하는, 새롭고 확률에 기반한 시간적 가치 함수인 TOPReward를 소개한다. VLM이 진행률 값을 직접 출력하도록 프롬프트하여 수치적 오표현의 우려가 있는 기존 방법들과 달리, TOPReward는 VLM의 내부 토큰 로짓(logits)에서 직접 작업 진행률을 추출한다. 130개 이상의 다양한 실제 작업과 다중 로봇 플랫폼(예: Franka, YAM, SO-100/101)에 걸친 제로샷 평가에서, TOPReward는 Qwen3-VL에서 0.947의 평균 가치-순서 상관관계(VOC)를 달성하여 동일한 오픈 소스 모델에서 0에 가까운 상관관계를 보인 최첨단 GVL 베이스라인을 압도적으로 뛰어넘었다. 우리는 더 나아가 TOPReward가 성공 감지 및 보상 정렬 행동 복제(reward-aligned behavior cloning)를 포함한 다운스트림 애플리케이션을 위한 다목적 도구로 기능함을 입증한다.

Original Abstract

While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

5 Citations

0 Influential

11 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!