2604.27975v1 Apr 30, 2026 cs.CV

TransVLM: 모든 종류의 전환을 감지하기 위한 비전-언어 프레임워크 및 벤치마크

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Zujin Guo

Citations: 249

h-index: 3

V. Goriachko

Citations: 2

h-index: 1

Zhenhui Ye

Citations: 1,831

h-index: 17

Zhibin Hong

Citations: 2,351

h-index: 22

Mingming Gong

Citations: 204

h-index: 8

Ce Chen

The University of Melbourne

Citations: 435

h-index: 4

Yi Ren

Citations: 298

h-index: 8

Yuanming Li

Citations: 135

h-index: 6

기존의 샷 경계 감지(SBD)는 단절된 지점을 중심으로 작업을 정의하기 때문에 복잡한 전환에 어려움을 겪으며, 종종 비디오 샷을 손상시키는 결과를 초래합니다. 우리는 이러한 근본적인 한계를 해결하기 위해 샷 전환 감지(STD)라는 새로운 태스크를 제안합니다. STD는 모호한 지점을 찾는 대신, 전환의 연속적인 시간적 구간을 명시적으로 감지합니다. 이를 위해, 우리는 STD를 위한 비전-언어 모델(VLM) 프레임워크인 TransVLM을 제안합니다. 일반적인 VLM은 주로 공간적 의미에 의존하며, 샷 간의 미묘한 동적인 변화를 처리하는 데 어려움을 겪는 반면, 우리의 방법은 입력 단계에서 광학 흐름을 중요한 운동 정보로 명시적으로 주입합니다. 간단하면서도 효과적인 특징 융합 전략을 통해, TransVLM은 색상 및 운동 정보를 결합하여 직접 처리함으로써, 추가적인 시각적 토큰 오버헤드 없이도 시간적 인지 능력을 크게 향상시킵니다. 공개 데이터에 존재하는 심각한 클래스 불균형 문제를 해결하기 위해, 우리는 다양한 전환 비디오를 합성하여 강력한 학습을 가능하게 하는 확장 가능한 데이터 엔진을 설계했으며, STD를 위한 포괄적인 벤치마크를 함께 제공합니다. 광범위한 실험 결과, TransVLM은 기존의 휴리스틱 방법, 특수 공간-시간 네트워크 및 최고 수준의 VLM보다 우수한 전반적인 성능을 달성함을 보여줍니다. 이 연구는 실제 서비스에 적용되었습니다. 관련 연구에 대한 자세한 내용은 HeyGen Research (https://www.heygen.com/research) 및 HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model)를 방문하십시오. 프로젝트 페이지: https://chence17.github.io/TransVLM/

Original Abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!