2605.01659v1 May 03, 2026 cs.CV

TRIMMER: 자기 지도 강화 학습을 통한 비디오 요약의 새로운 패러다임

TRIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning

Dimosthenis Karatzas

Citations: 213

h-index: 7

Pritam Mishra

Citations: 4

h-index: 1

C. Ballester

Citations: 7,654

h-index: 21

감시, 교육, 소셜 미디어 등 다양한 분야에서 비디오 콘텐츠가 급증하면서 효율적인 콘텐츠 이해의 중요성이 더욱 커지고 있습니다. 비디오 요약은 간결하면서도 의미 있는 표현을 생성하여 이러한 과제를 해결하지만, 기존 방법은 종종 비용이 많이 드는 수동 어노테이션에 의존하고, 도메인 간 일반화에 어려움을 겪으며, 복잡한 아키텍처로 인해 상당한 계산 비용이 발생합니다. 또한, 비지도 및 약지도 학습 방법은 일반적으로 지도 학습 방법에 비해 장기적인 시간적 의존성과 의미 구조를 파악하는 데 성능이 떨어지는 경향이 있습니다. 본 연구에서는 비디오 요약을 위한 새로운 자기 지도 강화 학습 프레임워크인 TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement)를 제안합니다. TRIMMER는 두 단계로 작동합니다. 먼저 자기 지도 학습을 통해 강력한 표현을 학습하고, 그 다음 정보 이론 기반 보상 함수에 의해 안내되는 강화 학습을 통해 공간-시간적 의사 결정을 수행합니다. 기존 방법이 유사성 기반 목표에 의존하는 것과는 달리, 본 연구에서는 엔트로피 기반 메트릭을 도입하여 고차원 시간적 역학 및 의미 다양성을 포착하고, 선택된 프레임 인덱스에 직접 보상을 계산하여 계산 효율성을 향상시킵니다. 표준 벤치마크에 대한 광범위한 실험 결과, TRIMMER는 비지도 및 자기 지도 학습 방법 중에서 최첨단 성능을 달성했으며, 선도적인 지도 학습 방법과 경쟁력을 유지하여 확장 가능하고 일반화 가능한 비디오 요약에 효과적임을 보여줍니다.

Original Abstract

The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust representations via self-supervised learning and then performs spatio-temporal decision making through reinforcement learning guided by information-theoretic reward functions. Unlike prior approaches that rely on similarity-based objectives, our method introduces entropy-based metrics to capture higher-order temporal dynamics and semantic diversity, while computing rewards directly over selected frame indices to improve computational efficiency. Extensive experiments on standard benchmarks demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods, while remaining competitive with leading supervised approaches, highlighting its effectiveness for scalable and generalizable video summarization.

1 Citations

0 Influential

10.5 Altmetric

53.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!