2603.09715v1 Mar 10, 2026 cs.AI

질문이 정말 중요한가? 비전-언어 모델의 지도 미세 조정(SFT)을 위한 학습 과정 없는 데이터 선택 방법

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Tianfan Fu

Citations: 7

h-index: 1

Pengqi Sun

Citations: 0

h-index: 0

Huawen Shen

Citations: 78

h-index: 6

Yanbo Wang

Citations: 453

h-index: 7

Yuqian Li

Citations: 4

h-index: 1

Yireh Ban

Citations: 0

h-index: 0

시각적 지시 튜닝은 비전-언어 거대 모델(VLLM)의 성능 향상에 매우 중요합니다. 그러나 많은 샘플은 언어적 패턴이나 상식적인 방법으로 해결될 수 있으며, 진정한 양방향 추론 없이 해결되어 다중 모달 학습의 효과를 제한합니다. 기존의 데이터 선택 방법은 종종 비용이 많이 드는 프록시 모델 훈련에 의존하며, 난이도나 다양성에 초점을 맞추어, 샘플이 비전-언어 연합 추론에 미치는 실제 기여도를 제대로 반영하지 못합니다. 본 논문에서는 CVS라는 학습 과정이 없는 데이터 선택 방법을 제안합니다. CVS는 고품질의 다중 모달 샘플의 경우, 질문을 추가하면 모델이 이미지에 기반하여 답변의 타당성을 평가하는 방식에 상당한 변화가 있어야 한다는 통찰력을 바탕으로 합니다. CVS는 동결된 VLLM을 평가기로 활용하고, 질문을 포함했을 때와 그렇지 않았을 때의 답변 타당성 차이를 측정하여, 비전-언어 연합 추론이 필요한 샘플을 식별하는 동시에 의미 충돌 노이즈를 제거합니다. Vision-Flan과 The Cauldron 데이터셋에 대한 실험 결과, CVS는 데이터셋 전반에 걸쳐 우수한 성능을 보였습니다. Vision-Flan에서 CVS는 전체 데이터로 훈련했을 때보다 각각 10% 및 15%의 데이터만 사용하여 3.5% 및 4.8% 더 높은 성능을 달성했으며, 매우 이질적인 Cauldron 데이터셋에서도 안정적인 성능을 유지했습니다. 또한, CVS는 COINCIDE 및 XMAS에 비해 계산 비용을 각각 17.3% 및 44.4% 절감했습니다.

Original Abstract

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!