2603.24539v1 Mar 25, 2026 cs.CV

CliPPER: 장시간 수술 절차 내 문맥 기반 비디오-언어 사전 학습 모델을 이용한 이벤트 인식

CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

N. Navab

Citations: 14

h-index: 2

N. Padoy

Citations: 2,286

h-index: 21

Florian Stilz

Citations: 3

h-index: 1

V. Srivastav

Citations: 753

h-index: 15

비디오-언어 기반 모델은 다양한 작업에서 제로샷 방식으로 매우 효과적인 것으로 입증되었습니다. 특히, 레이블된 데이터가 부족하고 복잡한 후속 작업에서 정확한 시간적 이해가 필요한 수술 절차 분야는 매우 어려운 영역입니다. 이러한 과제를 해결하기 위해, 수술 강의 비디오를 사용하여 학습된 새로운 비디오-언어 사전 학습 프레임워크인 CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition)를 제안합니다. 저희 방법은 정밀한 시간적 비디오-텍스트 인식을 위해 설계되었으며, 장시간 수술 비디오에서 다중 모드 정렬을 개선하기 위한 여러 가지 새로운 사전 학습 전략을 도입합니다. 구체적으로, 저희는 시간적 및 문맥적 의존성을 활용하여 로컬 비디오 이해를 향상시키는 Contextual Video-Text Contrastive Learning (VTC_CTX) 및 Clip Order Prediction (COP) 사전 학습 목표를 제안합니다. 또한, 동일한 수술 비디오 내의 비디오-텍스트 매칭에 대한 순환 일관성 정렬을 통합하여 양방향 일관성을 강화하고 전체 표현의 일관성을 향상시킵니다. 더불어, 비디오 프레임과 텍스트 간의 정렬을 개선하기 위해 Frame-Text Matching (FTM)이라는 더욱 정교한 정렬 손실 함수를 도입했습니다. 그 결과, 저희 모델은 단계, 절차, 기구 및 3항 정형 인식 등 여러 공개 수술 벤치마크에서 새로운 최고 성능을 달성했습니다. 소스 코드 및 사전 학습된 캡션은 https://github.com/CAMMA-public/CliPPER에서 확인할 수 있습니다.

Original Abstract

Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.

0 Citations

0 Influential

30.5 Altmetric

152.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!