2601.06122v1 Jan 04, 2026 cs.CV

COVR: 시각 기반 제어를 위한 VLM과 강화 학습 에이전트의 협력적 최적화

COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Canming Xia

Citations: 2

h-index: 1

Peixi Peng

Citations: 15

h-index: 3

Guang Tan

Citations: 20

h-index: 2

Zhan Su

Citations: 42

h-index: 4

Haoran Xu

Citations: 54

h-index: 5

Zhenxian Liu

Citations: 29

h-index: 2

Luntong Li

Citations: 14

h-index: 2

시각 기반 강화 학습(RL)은 복잡한 작업에서 고차원적인 관측 데이터로 인해 낮은 샘플 효율성을 겪습니다. 기존 연구에서는 시각-언어 모델(VLM)이 RL을 지원할 수 있다는 점을 보여주었지만, 대부분 VLM에서 RL로의 지식 전달에 초점을 맞추고 RL이 생성한 상호 작용 데이터가 VLM을 향상시킬 수 있는 잠재력을 간과했습니다. 이러한 문제를 해결하기 위해, 본 논문에서는 VLM과 RL 정책의 상호 향상을 가능하게 하는 협력적 최적화 프레임워크인 COVR을 제안합니다. 구체적으로, COVR은 목표 작업과 일관된 의미론적 추론 능력을 향상시키기 위해 RL이 생성한 데이터로 VLM을 미세 조정하고, 향상된 VLM을 사용하여 액션 사전(action priors)을 통해 정책 학습을 더욱 효과적으로 안내합니다. 미세 조정 효율성을 높이기 위해, (1) 탐색 주도 동적 필터(Exploration-Driven Dynamic Filter) 모듈을 도입하여, 탐색 정도에 따른 적응적 임계값을 사용하여 가치 있는 탐색 샘플을 보존하고, (2) 보상 인지 적응적 손실 가중치(Return-Aware Adaptive Loss Weight) 모듈을 도입하여, RL의 보상 신호를 통해 샘플링 액션의 불일치를 정량화하여 훈련의 안정성을 향상시킵니다. 또한, 리소스 소비를 줄이기 위한 점진적인 미세 조정 전략을 설계했습니다. 광범위한 실험 결과는 COVR이 다양한 어려운 시각 제어 작업에서 뛰어난 성능을 달성함을 보여줍니다.

Original Abstract

Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!