2604.28123v1 Apr 30, 2026 cs.CV

PRISM: 다중 모드 강화 학습을 위한 블랙박스 온-폴리시 증류를 통한 사전 정렬

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Xiaomin Yu

Citations: 5

h-index: 1

Yunjian Zhang

Citations: 64

h-index: 5

Weiquan Huang

Citations: 7

h-index: 2

Hehai Lin

Citations: 64

h-index: 3

Sudong Wang

Citations: 41

h-index: 3

Beier Zhu

Citations: 232

h-index: 7

Chaojun Xiao

Citations: 3,292

h-index: 23

Keming Wu

Citations: 145

h-index: 8

Chengwei Qin

Citations: 33

h-index: 3

Zuhao Yang

Citations: 47

h-index: 4

Chen Chen

Citations: 38

h-index: 3

Wenxuan Wang

Citations: 0

h-index: 0

대규모 다중 모드 모델(LMM)의 일반적인 후처리 과정은 큐레이션된 데모를 사용한 지도 미세 조정(SFT)과 검증 가능한 보상을 사용한 강화 학습(RLVR)으로 구성됩니다. 그러나 SFT는 모델의 원래 기능을 유지하지 못하고, 또한 감독 데이터의 분포와 일치하지 않는 분포 변화를 초래합니다. 이러한 문제는 다중 모드 추론에서 더욱 심화되는데, 여기서 인지 오류와 추론 실패는 서로 다른 분포 변화 패턴을 가지며, 이후 강화 학습 과정에서 이러한 패턴들이 복합적으로 작용합니다. 본 논문에서는 PRISM이라는 세 단계 파이프라인을 제안하며, 이는 SFT와 RLVR 사이에 명시적인 분포 정렬 단계를 삽입하여 이러한 분포 변화를 완화합니다. PRISM은 온-폴리시 증류(OPD)의 원리를 기반으로, 정렬을 정책과 혼합 전문가(MoE) 판별기 간의 블랙박스, 응답 수준에서의 적대적 게임으로 정의합니다. 이 때, 전용 인지 및 추론 전문가를 사용하여 정책에 분리된 교정 신호를 제공함으로써, 정책을 감독 데이터의 분포로 유도합니다. 126만 개의 공개 데모는 광범위한 SFT 초기화에 충분하지만, 분포 정렬에는 더 높은 품질의 감독 데이터가 필요합니다. 따라서 우리는 Gemini 3 Flash에서 추출한 113,000개의 추가 데모를 큐레이션했는데, 이 데모는 가장 어려운 해결되지 않은 문제에 대한 상세한 시각적 연결 정보와 단계별 추론을 포함합니다. Qwen3-VL 모델에 대한 실험 결과, PRISM은 다양한 강화 학습 알고리즘(GRPO, DAPO, GSPO) 및 다양한 다중 모드 벤치마크에서 일관되게 RLVR 성능을 향상시켰으며, 4B 및 8B 모델에서 각각 평균 정확도를 +4.4 및 +6.0 포인트 향상시켰습니다. 본 논문의 코드, 데이터 및 모델 체크포인트는 https://github.com/XIAO4579/PRISM 에서 공개적으로 이용 가능합니다.

Original Abstract

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.

0 Citations

0 Influential

39.547189562171 Altmetric

197.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!