2603.10652v1 Mar 11, 2026 cs.CV

비디오 추론 모델은 실제 환경에서 사용될 준비가 되었는가?

Are Video Reasoning Models Ready to Go Outside?

Jaehong Yoon

Citations: 190

h-index: 5

Yangfan He

Citations: 62

h-index: 3

C. Boo

Citations: 58

h-index: 2

실제 환경에 비전-언어 모델을 적용할 때, 기상 변화, 가려짐, 카메라 움직임과 같은 다양한 요인으로 인해 모델의 성능이 저하되는 경우가 많습니다. 이러한 조건 하에서 모델의 이해력과 추론 능력은 현저하게 감소하며, 이는 깨끗하고 통제된(즉, 교란 없는) 평가 환경과 실제 환경에서의 견고성 간의 격차를 드러냅니다. 이러한 한계를 극복하기 위해, 우리는 ROVA라는 새로운 학습 프레임워크를 제안합니다. ROVA는 시공간적 교란을 모델링하면서 견고성을 고려한 일관성 보상을 통해 모델의 견고성을 향상시킵니다. 또한, ROVA는 모델의 발전하는 능력에 따라 유용한 샘플을 우선시하는, 어려움에 대한 인식을 갖춘 온라인 학습 전략을 도입합니다. 구체적으로, ROVA는 자기 평가를 통해 샘플의 어려움을 지속적으로 재평가하여, 견고성을 고려한 일관성 보상을 통해 적응적인 학습을 가능하게 합니다. 또한, 우리는 PVRBench라는 새로운 벤치마크를 소개합니다. PVRBench는 실제 환경의 교란을 인공 비디오 데이터셋에 주입하여, 현실적인 상황에서의 정확도와 추론 품질을 평가합니다. 우리는 ROVA와 기존 모델을 PVRBench, UrbanVideo, VisBench에서 평가한 결과, 공개 소스 및 독점 모델이 현실적인 교란 하에서 최대 35% 및 28%의 정확도 및 추론 성능 저하를 보이는 것을 확인했습니다. ROVA는 이러한 성능 저하를 효과적으로 완화하며, 기존 모델(QWen2.5/3-VL, InternVL2.5, Embodied-R)에 비해 최소 24%의 정확도 향상과 9% 이상의 추론 성능 향상을 보였습니다. 이러한 개선 효과는 표준 벤치마크에서도 나타나, 일관된 성능 향상을 가져왔습니다.

Original Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!