2603.01195v1 Mar 01, 2026 cs.CV

VisNec: 시각적 필수성 측정 및 활용을 통한 다중 모드 명령어 튜닝

VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning

Yuqian Fu

Citations: 245

h-index: 9

Mingkang Dong

Citations: 6

h-index: 2

Hongyi Cai

Citations: 36

h-index: 3

Jie Li

Citations: 5

h-index: 1

Sifan Zhou

Citations: 22

h-index: 2

Bin Ren

Citations: 3

h-index: 1

Kunyu Peng

Citations: 1,753

h-index: 21

다중 모드 명령어 튜닝의 효과는 데이터셋의 규모뿐만 아니라, 훈련 샘플이 실제로 시각적 추론을 필요로 하는지에 달려 있습니다. 그러나 기존의 명령어 데이터셋은 종종 시각적으로 불필요한 샘플(텍스트만으로 해결 가능한 샘플)이 상당 부분을 차지하며, 또한 다중 모드 간의 불일치가 학습을 저하시킬 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 시각적 입력이 명령어 튜닝 과정에서 얼마나 중요한 역할을 하는지를 측정하는 원칙적인 데이터 선택 프레임워크인 VisNec (Visual Necessity Score)을 제안합니다. VisNec은 시각적 맥락 유무에 따른 예측 손실을 비교하여, 훈련 인스턴스가 시각적으로 필수적인지, 불필요한지, 또는 다중 모드 간에 불일치가 있는지 판단합니다. 작업 다양성을 유지하기 위해, 우리는 VisNec을 의미론적 클러스터링과 결합하여 각 클러스터 내에서 높은 필수성을 가진 샘플을 선택합니다. 10개의 다운스트림 벤치마크에서, VisNec에 의해 선택된 LLaVA-665K 데이터셋의 15%만을 사용하여 훈련했을 때, 전체 데이터셋을 사용하여 훈련했을 때의 성능(100.2%)을 능가했습니다. 더 작은 Vision-Flan-186K 데이터셋에서, 우리의 선택은 데이터 크기를 더욱 줄이는 동시에 전체 데이터셋으로 훈련하는 것보다 15.8% 더 높은 성능을 달성했습니다. 이러한 결과는 시각적 필수성을 측정하고 활용하는 것이 효율적이고 강력한 다중 모드 명령어 튜닝을 위한 효과적인 솔루션임을 보여줍니다. 채택되면 코드와 선택된 데이터셋 하위 집합을 공개할 예정입니다.

Original Abstract

The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.

2 Citations

0 Influential

10.5 Altmetric

54.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!