2603.18656v1 Mar 19, 2026 cs.AI

균형 잡힌 사고: 시각 언어 모델의 사고 과정 학습 개선

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Nimrod Shabtay

Citations: 60

h-index: 4

Eli Schwartz

Citations: 4

h-index: 1

Shaked Perek

Citations: 61

h-index: 3

Ben Wiesel

Citations: 80

h-index: 3

Avihu Dekel

Citations: 298

h-index: 5

시각-언어 모델(VLMs)에서 다중 모드 추론은 일반적으로 두 단계의 과정을 따릅니다: 지도 미세 조정(SFT)과 강화 학습(RL). 표준 SFT에서는 모든 토큰이 손실에 동일하게 기여하지만, 추론 데이터는 본질적으로 토큰 불균형을 가지고 있습니다. 긴 <think> (생각) 단계가 짧지만 과제 수행에 중요한 <answer> (답변) 단계를 압도하여, 장황한 추론과 부정확한 답변을 초래합니다. 우리는 SCALe (Scheduled Curriculum Adaptive Loss, 예약된 커리큘럼 적응 손실)를 제안합니다. SCALe는 동적이고 길이와 독립적인 가중치를 사용하여 추론 및 답변 단계를 명시적으로 분리하여 지도 학습을 수행합니다. 일반적인 SFT는 <think> 단계를 과도하게 강조하는 반면, SCALe-SFT는 코사인 스케줄링 정책을 통해 학습 과정에서 점진적으로 <think>에서 <answer>로 초점을 이동시켜, 간결하고 근거 있는 추론을 장려합니다. 우리는 다양한 벤치마크와 아키텍처에서 SCALe를 평가했습니다. 결과는 SCALe가 일반적인 SFT보다 일관되게 정확도를 향상시키며, 전체 두 단계 SFT + GRPO 파이프라인의 성능과 일치하는 동시에 학습 시간은 약 1/7로 줄어들어, 가볍고 효과적인 대안임을 보여줍니다. GRPO와 결합했을 때 SCALe는 최고의 전체 성능을 달성하며, 이는 SCALe가 독립적인 방법으로서 뿐만 아니라 강화 학습 개선을 위한 강력한 기반이 됨을 강조합니다.

Original Abstract

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!