2603.21289v1 Mar 22, 2026 cs.CV

모델이 스스로 판단하다: 다중 모드 추론을 위한 비지도 자기 진화

When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

Zhengxian Wu

Citations: 13

h-index: 2

Zirui Liao

Citations: 7

h-index: 2

Hangrui Xu

Citations: 10

h-index: 2

Haoqian Wang

Citations: 20

h-index: 2

Haonan Lu

Citations: 160

h-index: 6

Kai Shi

Citations: 3

h-index: 1

Chuanrui Zhang

Citations: 105

h-index: 4

Jun Yang

Citations: 2

h-index: 1

Ni Yang

Citations: 6

h-index: 1

Qiuying Peng

Citations: 105

h-index: 5

Luyuan Zhang

Citations: 50

h-index: 3

Tianhuang Su

Citations: 4

h-index: 1

Zhenyu Yang

Citations: 4

h-index: 1

최근 다중 모드 대규모 언어 모델의 발전은 추론 작업에서 뛰어난 성능을 보여주었지만, 이러한 개선은 대부분 고품질의 어노테이션 데이터 또는 교사 모델 증류에 의존하며, 이는 비용이 많이 들고 확장하기 어렵습니다. 이러한 문제를 해결하기 위해, 우리는 인간이 어노테이션한 답변이나 외부 보상 모델을 사용하지 않고도 안정적인 성능 향상을 달성하는 다중 모드 추론을 위한 비지도 자기 진화 훈련 프레임워크를 제안합니다. 각 입력에 대해 여러 개의 추론 경로를 샘플링하고, 이들의 그룹 내 구조를 공동으로 모델링합니다. 우리는 액터의 자기 일관성 신호를 훈련 우선순위로 사용하고, 지속적으로 다양한 품질의 경로에 가중치를 부여하기 위해 제한된 Judge 기반 조절을 도입합니다. 또한, 조절된 점수를 그룹 수준의 분포로 모델링하고, 절대 점수를 각 그룹 내의 상대적 이점으로 변환하여 보다 강력한 정책 업데이트를 가능하게 합니다. 저희 방법은 비표시 데이터에 대해 Group Relative Policy Optimization (GRPO)으로 훈련되었으며, 다섯 가지 수학적 추론 벤치마크에서 꾸준히 추론 성능과 일반화 능력을 향상시킵니다. 이는 자기 진화 다중 모드 모델을 위한 확장 가능한 경로를 제공합니다. 코드는 https://dingwu1021.github.io/SelfJudge/ 에서 확인할 수 있습니다.

Original Abstract

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale.To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure.We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality.We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models.The code are available at https://dingwu1021.github.io/SelfJudge/.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!