2603.03975v1 Mar 04, 2026 cs.AI

Phi-4-reasoning-vision-15B 기술 보고서

Phi-4-reasoning-vision-15B Technical Report

Jy-oti Aneja

Citations: 70

h-index: 4

Michael Harrison

Citations: 679

h-index: 3

Neel Joshi

Citations: 571

h-index: 10

Tyler LaBonte

Citations: 72

h-index: 2

John Langford

Citations: 125

h-index: 7

Eduardo Salinas

Citations: 307

h-index: 3

본 보고서에서는 Phi-4-reasoning-vision-15B 모델을 소개합니다. 이는 작고 효율적인 다중 모드 추론 모델이며, 모델 개발 과정에서 고려된 동기, 설계 선택, 실험 결과, 그리고 얻은 교훈들을 공유합니다. 저희의 목표는 더 작고 효율적인 다중 모드 추론 모델을 구축하는 데 필요한 실질적인 통찰력을 연구 커뮤니티에 제공하고, 이러한 경험을 바탕으로 일반적인 시각 및 언어 작업에 뛰어나며 과학적 및 수학적 추론, 사용자 인터페이스 이해에 강점을 보이는 오픈 가중 모델을 제공하는 것입니다. 본 연구의 주요 기여는 다음과 같습니다. 신중한 아키텍처 설계와 엄격한 데이터 정제 과정을 통해, 작은 규모의 오픈 가중 다중 모드 모델이 훨씬 적은 연산량과 토큰을 사용하여 경쟁력 있는 성능을 달성할 수 있음을 보여줍니다. 가장 큰 성능 향상은 체계적인 필터링, 오류 수정, 그리고 합성 데이터 증강을 통해 이루어졌으며, 이는 데이터 품질이 모델 성능의 핵심 요소임을 다시 한번 강조합니다. 체계적인 분석 결과, 고해상도 및 동적 해상도 인코더가 일관된 성능 향상을 가져옴을 확인했으며, 이는 정확한 인식이 고품질 추론의 필수 조건임을 보여줍니다. 마지막으로, 추론 데이터와 비추론 데이터를 혼합하고 명시적인 모드 토큰을 사용하는 방식으로, 하나의 모델이 단순 작업에 대한 빠른 답변과 복잡한 문제에 대한 연쇄적 추론을 모두 제공할 수 있음을 보여줍니다.

Original Abstract

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation -- reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.

2 Citations

0 Influential

5 Altmetric

27.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!