2604.03647v1 Apr 04, 2026 cs.CV

지속적인 소프트 리트레이싱 재샘플링을 통한 멀티모달 대규모 언어 모델의 안정적인 자기 진화

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Zhengxian Wu

Citations: 8

h-index: 2

Zirui Liao

Citations: 6

h-index: 2

Hangrui Xu

Citations: 6

h-index: 2

Haoqian Wang

Citations: 14

h-index: 2

Yu Yu

Citations: 22

h-index: 1

Zhuo Chen

Citations: 8

h-index: 1

Xiang Deng

Citations: 2,870

h-index: 13

Zhifang Liu

Citations: 13

h-index: 2

Senyuan Shi

Citations: 15

h-index: 2

멀티모달 대규모 언어 모델(MLLM)의 비지도 자기 진화에서, 사후 훈련 과정 동안 피드백 신호의 품질은 안정적이고 효과적인 학습에 매우 중요합니다. 그러나 기존의 자기 진화 방법은 주로 가장 빈번한 출력을 가짜 정답으로 선택하기 위해 다수결 투표 방식을 사용하는데, 이는 모델의 내재적 편향에서 비롯될 수 있으며, 추론 경로의 객관적인 정확성을 보장하지 못할 수 있습니다. 이러한 문제를 해결하기 위해, 본 연구에서는 MLLM의 자기 진화를 위한 **C**ontinuous **S**oftened **R**etracing re**S**ampling (**CSRS**) 방법을 제안합니다. 구체적으로, 우리는 모델이 앵커 지점에서 재추론을 수행하여 장기적인 추론 경로 탐색을 확장하는 Retracing Re-inference Mechanism (**RRM**)을 도입했습니다. 동시에, 이진 보상 대신 연속적인 신호를 사용하는 Softened Frequency Reward (**SFR**)를 제안하여, 샘플링된 추론 집합에서 답변의 빈도를 기반으로 보상을 조정합니다. 또한, Visual Semantic Perturbation (**VSP**)을 통합하여 CSRS가 모델이 시각적인 피상적인 요소보다 수학적 논리를 우선하도록 합니다. 실험 결과는 CSRS가 MathVision과 같은 벤치마크에서 Qwen2.5-VL-7B의 추론 성능을 크게 향상시킨다는 것을 보여줍니다. 우리는 기하학적 작업에서 비지도 자기 진화 분야에서 최고 성능(SOTA)을 달성했습니다. 저희 코드의 GitHub 주소는 https://github.com/yyy195/CSRS 입니다.

Original Abstract

In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.

0 Citations

0 Influential

26.5 Altmetric

132.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!