2604.13356v1 Apr 14, 2026 cs.CL

동료 예측 기반 자기 학습을 통한 언어 모델 추론 향상

Peer-Predictive Self-Training for Language Model Reasoning

Fan Nie

Citations: 52

h-index: 5

Hanlin Zhang

Citations: 1,180

h-index: 14

Shi Feng

Citations: 57

h-index: 4

Sham M. Kakade

Citations: 1

h-index: 1

Yiling Chen

Citations: 101

h-index: 2

외부 감독 없이 언어 모델의 지속적인 자기 개선 메커니즘은 여전히 해결해야 할 과제입니다. 본 연구에서는 동료 예측 기반 자기 학습(Peer-Predictive Self-Training, PST)이라는 레이블이 없는 미세 조정 프레임워크를 제안합니다. 이 프레임워크에서는 여러 언어 모델이 서로의 응답을 활용하여 집계된 응답을 내부 학습 신호로 사용하여 협력적으로 성능을 향상시킵니다. 주어진 질문에 대해 모델은 순차적으로 응답을 생성하며, 생성된 응답을 집계한 최종 결과는 실제로는 개별 응답보다 더 신뢰할 수 있는 경우가 많으며, 이는 학습을 위한 내부 목표로 사용됩니다. 각 중간 응답이 집계된 응답에 대해 얼마나 유용한 정보를 제공하는지 pointwise mutual information (PMI)을 사용하여 측정하고, 이 신호를 사용하여 자기 학습 업데이트를 조정합니다. 이미 집계된 응답과 일치하는 응답은 업데이트 빈도가 낮고, 유용하지 않거나 일치하지 않는 응답은 업데이트 빈도가 높습니다. 수학적 추론 벤치마크(SimulEq, Math500, MultiArith)에서 PST는 Gemma-2-2B, LLaMA-3.2-1B, Qwen-2.5-1.5B 모델에서 정확도(exact-match accuracy)를 2.2에서 4.3%p 향상시키고, 생성자-검증자 간 격차(generator-verifier gap, GV-Gap)를 26%에서 40% 감소시켰습니다. 이는 외부 감독이나 교사-학생 구조 없이, 모델 간 상호 작용만을 통해 달성된 결과입니다. 이러한 결과는 모델 간 생성 및 동료 예측 피드백이 자기 지도 학습에 효과적인 접근 방식이 될 수 있음을 시사합니다.

Original Abstract

Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!