2603.01694v1 Mar 02, 2026 cs.CV

MVR: 강화 학습을 위한 다중 시점 비디오 보상 형성

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

Yaodong Yang

Citations: 10

h-index: 2

Lirui Luo

Citations: 23

h-index: 2

Guoxi Zhang

Citations: 28

h-index: 3

Hongming Xu

Citations: 26

h-index: 2

Cong Fang

Citations: 82

h-index: 3

Qing Li

Citations: 24

h-index: 2

강화 학습을 통해 복잡한 문제를 해결하는 데 있어 보상 설계는 매우 중요합니다. 최근 연구에서는 시각-언어 모델(VLM)이 생성하는 이미지-텍스트 유사성을 활용하여 시각적 피드백을 통해 작업의 보상을 향상시키는 방법을 모색해 왔습니다. 일반적인 방법은 VLM 점수를 명시적인 보상 형성을 거치지 않고 작업 보상 또는 성공 보상에 선형적으로 더하는 방식으로, 이는 최적의 정책을 변경할 가능성이 있습니다. 또한, 이러한 접근 방식은 종종 단일 정적 이미지에 의존하기 때문에, 여러 시각적으로 다른 상태를 포괄하는 복잡하고 동적인 움직임을 포함하는 작업에 어려움을 겪습니다. 더욱이, 단일 시점은 에이전트의 행동에 중요한 측면을 가릴 수 있습니다. 이러한 문제를 해결하기 위해, 본 논문에서는 다중 시점 비디오 보상 형성(MVR)이라는 프레임워크를 제시합니다. MVR은 여러 시점에서 캡처된 비디오를 사용하여 대상 작업과 관련된 상태의 관련성을 모델링합니다. MVR은 동결된 사전 훈련된 VLM에서 얻은 비디오-텍스트 유사성을 활용하여 이미지 기반 방법에서 발생하는 특정 정적 자세에 대한 편향을 완화하는 상태 관련성 함수를 학습합니다. 또한, 본 논문에서는 작업별 보상과 VLM 기반 지침을 통합하는 상태 의존적 보상 형성 방식을 도입합니다. 이 방식은 원하는 운동 패턴이 달성되면 VLM 기반 지침의 영향을 자동으로 줄입니다. 제안된 프레임워크의 효과는 HumanoidBench의 어려운 인간형 로봇 보행 작업 및 MetaWorld의 조작 작업에 대한 광범위한 실험을 통해 확인되었으며, 삭제 실험을 통해 설계 선택 사항을 검증했습니다.

Original Abstract

Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!