2603.22918v1 Mar 24, 2026 cs.CV

EVA: 엔드 투 엔드 비디오 에이전트를 위한 효율적인 강화 학습

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Yao Zhang

Citations: 135

h-index: 7

Haonan Lu

Citations: 90

h-index: 4

Jiahao Wang

Citations: 19

h-index: 2

Haonan Duan

Citations: 163

h-index: 5

Lewei Lu

Citations: 24

h-index: 3

Ruohui Wang

Citations: 277

h-index: 2

Yepeng Tang

Citations: 138

h-index: 7

Xuanyu Zheng

Citations: 7

h-index: 1

Hanming Deng

Citations: 1,213

h-index: 11

멀티모달 대규모 언어 모델(MLLM)을 활용한 비디오 이해는 비디오의 긴 토큰 시퀀스와 광범위한 시간적 의존성, 그리고 중복된 프레임 때문에 여전히 어려운 과제입니다. 기존의 접근 방식은 일반적으로 MLLM을 수동적인 인식기로 취급하며, 전체 비디오 또는 균일하게 추출된 프레임을 처리할 때 적응적인 추론을 수행하지 않습니다. 최근의 에이전트 기반 방법은 외부 도구를 도입하지만, 여전히 수동으로 설계된 워크플로우와 인식-우선 전략에 의존하여 긴 비디오에서 비효율적인 결과를 초래합니다. 본 논문에서는 계획-인식-행동-반성(plan-before-perception)을 통해 비디오 에이전트를 위한 효율적인 강화 학습 프레임워크인 EVA를 제안합니다. EVA는 자율적으로 무엇을 시청할지, 언제 시청할지, 그리고 어떻게 시청할지를 결정하여 쿼리 기반의 효율적인 비디오 이해를 달성합니다. 이러한 에이전트를 학습하기 위해, 우리는 지도 학습(SFT), 칸네먼-트베르스키 최적화(KTO), 그리고 일반화된 보상 정책 최적화(GRPO)를 포함하는 간단하면서도 효과적인 세 단계 학습 파이프라인을 설계하여 지도 모방 학습과 강화 학습을 연결합니다. 또한, 각 단계에 적합한 고품질 데이터셋을 구축하여 안정적이고 재현 가능한 학습을 지원합니다. EVA는 여섯 가지 비디오 이해 벤치마크에서 평가되었으며, 종합적인 성능을 입증했습니다. 기존의 기준 모델과 비교했을 때, EVA는 일반적인 MLLM 기준 모델보다 6-12% 향상된 성능을 보이며, 기존의 적응형 에이전트 방법보다 1-3% 더 높은 성능을 달성했습니다. EVA의 코드와 모델은 https://github.com/wangruohui/EfficientVideoAgent 에서 확인할 수 있습니다.

Original Abstract

Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.

0 Citations

0 Influential

25.5 Altmetric

127.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!