2604.05117v1 Apr 06, 2026 cs.CV

답변하기 전에 먼저 보세요: 시각적으로 기반한 추가 학습을 통한 학습

Watch Before You Answer: Learning from Visually Grounded Post-Training

Ping Nie

Citations: 394

h-index: 11

Dongfu Jiang

University of Waterloo

Citations: 3,632

h-index: 13

Huaisong Zhang

Citations: 21

h-index: 2

Yuxuan Zhang

Citations: 45

h-index: 3

Eunjeong Hwang

Citations: 6

h-index: 2

Penghui Du

Citations: 18

h-index: 3

Yiming Jia

Citations: 75

h-index: 4

Xuan He

Citations: 1,910

h-index: 6

Shen Zhang

Citations: 6

h-index: 2

Peter West

Citations: 6

h-index: 2

Kelsey Allen

Citations: 7

h-index: 2

시각-언어 모델(VLM)은 시각적, 시간적, 텍스트적 단서를 종합적으로 이해하는 것이 매우 중요합니다. 그러나 다중 모드 모델링의 빠른 발전에도 불구하고, 비디오 이해 성능은 여전히 텍스트 기반 추론에 비해 뒤쳐집니다. 본 연구에서, 우리는 이전의 가정보다 발전이 더 미흡하다는 것을 발견했습니다. 일반적으로 사용되는 장기 비디오 이해 벤치마크에는 텍스트 단서만으로도 답할 수 있는 질문이 40-60% 포함되어 있습니다. 더욱이, 이러한 문제는 널리 사용되는 추가 학습 데이터셋에서도 나타나며, 이는 추가 학습이 VLM의 비디오 이해 성능을 향상시키는 데 미치는 영향을 잠재적으로 저해할 수 있습니다. 이러한 관찰을 바탕으로, 우리는 시각적으로 기반된 질문만을 사용하고 언어적 편향을 제거한 간단하면서도 효과적인 솔루션인 VidGround을 제안합니다. 강화 학습 기반의 추가 학습 알고리즘과 함께 사용할 때, 이 간단한 기술은 전체 데이터셋을 사용하는 것보다 최대 6.2 포인트의 성능 향상을 가져오며, 원래 추가 학습 데이터의 69.1%만을 사용합니다. 또한, 간단한 추가 학습 알고리즘을 사용한 데이터 큐레이션이 여러 가지 더 복잡한 추가 학습 기술보다 우수한 성능을 보인다는 것을 보여주며, 이는 데이터 품질이 VLM의 비디오 이해 성능을 향상시키는 데 있어 주요 장애물임을 강조합니다. 이러한 결과는 VLM의 더욱 발전된 개발을 위해 시각적 기반이 실제로 필요한 추가 학습 데이터와 평가 벤치마크를 큐레이션하는 것이 중요하다는 점을 강조합니다. 프로젝트 페이지: http://vidground.etuagi.com.

Original Abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!