2601.09851v2 Jan 14, 2026 cs.CV

ViSIL: 다중 모드 비디오 캡셔닝에서 정보 손실에 대한 통합 평가

ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning

U. Topcu

Citations: 14,490

h-index: 54

Po-han Li

The University of Texas at Austin

Citations: 69

h-index: 5

Shenghui Chen

Citations: 43

h-index: 4

Sandeep P. Chinchali

Citations: 1,344

h-index: 14

다중 모드 비디오 캡셔닝은 방대한 영상 정보를 핵심 프레임과 자연어라는 구조화된 형식으로 요약합니다. 이 접근 방식은 일관성 있는 다중 모드 요약을 생성하여 생성형 AI를 풍부한 의미 정보에 기반하도록 하고, 고효율 검색을 위한 경량적인 대리 역할을 합니다. 그러나 BLEU 또는 ROUGE와 같은 기존 지표는 텍스트 문단과 핵심 프레임 시퀀스와 같이 서로 다른 모드 간의 정보 보장 범위를 정량화하는 데 실패합니다. 이러한 문제를 해결하기 위해, 비디오 요약 정보 손실(ViSIL) 점수를 제안합니다. ViSIL은 시각-언어 모델(VLM) 추론을 통해 요약에서 포착되지 않는 비디오 정보를 정량화하는 정보 이론 기반 프레임워크입니다. ViSIL은 정보 손실을 측정하여 구조적 차이에도 불구하고 다양한 다중 모드 요약 형식을 직접 비교할 수 있는 통합 지표입니다. 우리의 결과는 ViSIL 점수가 비디오 질의 응답(VQA) 작업에서 인간 평가 및 VLM 성능과 통계적으로 유의미한 상관 관계를 보임을 보여줍니다. 또한 ViSIL은 정보 손실과 처리 속도 간의 균형을 최적화하여 요약 선택을 가능하게 하며, 텍스트 요약보다 VQA 정확도가 7% 향상되지만 처리 부하가 증가하지 않는 파레토 최적의 경계를 설정합니다.

Original Abstract

Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.

1 Citations

0 Influential

27 Altmetric

136.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!