2605.06094v1 May 07, 2026 cs.CV

VISD: 구조화된 자기 증류를 통한 비디오 추론 능력 향상

VISD: Enhancing Video Reasoning via Structured Self-Distillation

Hongbo Jin

Citations: 36

h-index: 4

Zhongjing Du

Citations: 21

h-index: 3

Qiao Zhang

Citations: 12

h-index: 3

Kunyang Lv

Citations: 3

h-index: 1

Jingqi Tian

Citations: 35

h-index: 3

Hao Lin

Citations: 4

h-index: 1

Xu Jiang

Citations: 15

h-index: 2

Jiayuan Ding

Citations: 286

h-index: 4

복잡한 추론을 수행하는 비디오 LLM을 학습하는 것은 여전히 어려운 과제입니다. 이는 시퀀스 레벨의 희소한 보상과 긴 시간적 맥락을 가진 추론 경로에 대한 미세한 보상 할당의 부족 때문입니다. 강화 학습과 검증 가능한 보상(RLVR)은 신뢰할 수 있는 감독을 제공하지만, 토큰 레벨의 기여도를 파악하지 못하여 학습 효율성이 떨어집니다. 반면, 기존의 자기 증류 방법은 밀집된 감독 신호를 제공하지만, 구조가 부족하고 진단적인 특수성이 결여되어 있으며, 종종 강화 학습과 불안정하게 상호 작용합니다. 본 연구에서는 비디오 추론을 위한 진단적으로 의미 있는 우선 정보(privileged information)를 도입하는 구조화된 자기 증류 프레임워크인 VISD를 제안합니다. VISD는 비디오에 대한 이해를 가진 평가 모델을 사용하여 추론 품질을 정답 여부, 논리적 일관성, 시공간적 연결성과 같은 여러 측면으로 분해하고, 이러한 구조화된 피드백을 사용하여 토큰 레벨의 감독을 위한 교사 정책을 안내합니다. 밀집된 감독 신호를 강화 학습과 안정적으로 통합하기 위해, 롤아웃 레벨의 이점을 보상에서 계산하여 업데이트 방향을 결정하고, 구조화된 우선 신호를 사용하여 토큰 레벨의 업데이트 크기를 조절하는 방향-크기 분리 메커니즘을 도입합니다. 이러한 설계는 의미론적으로 정렬되고 미세한 보상 할당을 가능하게 하여 추론의 정확성과 학습 효율성을 모두 향상시킵니다. 또한, VISD는 커리큘럼 스케줄링과 EMA 기반의 교사 안정화 기법을 통합하여 긴 비디오 시퀀스에 대한 강력한 최적화를 지원합니다. 다양한 벤치마크에서의 실험 결과, VISD는 강력한 기준 모델을 지속적으로 능가하며, 정답 정확도와 시공간적 연결성 품질을 향상시킵니다. 특히, VISD는 최적화 단계에서 약 2배 빠른 수렴 속도를 달성하여, 구조화된 자기 감독이 비디오 LLM의 성능과 샘플 효율성을 향상시키는 데 효과적임을 보여줍니다.

Original Abstract

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

3 Citations

0 Influential

2 Altmetric

13.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!