2604.09531v1 Apr 10, 2026 cs.CV

VisionFoundry: 합성 이미지 기반 시각적 인식 능력을 VL 모델에 학습시키는 방법

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Guanyu Zhou

Citations: 55

h-index: 3

Zhuang Liu

Citations: 518

h-index: 4

Yida Yin

Citations: 146

h-index: 6

Wenhao Chai

Citations: 78

h-index: 3

Shengbang Tong

Citations: 155

h-index: 3

Xingyu Fu

Citations: 24

h-index: 3

현재의 비전-언어 모델(VLM)은 공간 이해 및 시점 인식과 같은 시각적 인식 작업에서 어려움을 겪고 있습니다. 이러한 어려움의 한 가지 원인은 자연 이미지 데이터셋이 저수준 시각적 기술에 대한 제한적인 지침을 제공한다는 것입니다. 이에 대한 해결책으로, '깊이 순서'와 같은 특정 키워드를 사용하여 생성된 맞춤형 합성 데이터가 이러한 약점을 해결할 수 있는지에 대한 질문이 제기됩니다. 이 질문을 조사하기 위해, 우리는 VisionFoundry라는 작업 인식 합성 데이터 생성 파이프라인을 소개합니다. VisionFoundry는 작업 이름만 입력으로 받아, 대규모 언어 모델(LLM)을 사용하여 질문, 답변, 그리고 텍스트-이미지(T2I) 프롬프트를 생성하고, T2I 모델을 사용하여 이미지를 합성하며, 독자적인 VLM을 사용하여 일관성을 검증합니다. 이 과정에서 참조 이미지나 인간의 주석은 필요하지 않습니다. VisionFoundry를 사용하여, 우리는 10가지 작업에 걸쳐 1만 개의 이미지-질문-답변 세트를 포함하는 합성 시각적 질문 답변(VQA) 데이터셋인 VisionFoundry-10K를 구축했습니다. VisionFoundry-10K로 학습된 모델은 시각적 인식 벤치마크에서 상당한 성능 향상을 보였습니다. 구체적으로 MMVP에서 +7%, CV-Bench-3D에서 +10%의 향상을 보였으며, 더 넓은 기능을 유지하고 데이터 크기가 증가함에 따라 우수한 확장성을 보였습니다. 이러한 결과는 제한된 작업 중심의 지침이 이러한 성능 저하의 중요한 원인이며, 합성 데이터가 VLM의 보다 체계적인 학습을 위한 유망한 방법임을 시사합니다.

Original Abstract

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!