2605.03871v1 May 05, 2026 cs.AI

EvoLM: 공진화된 판별 기준을 통한 자기 진화 언어 모델

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

S. Li

Citations: 809

h-index: 13

Pang Wei Koh

Citations: 244

h-index: 6

Y. Tsvetkov

Citations: 498

h-index: 7

R. Xin

Citations: 33

h-index: 2

Teng Xiao

Citations: 122

h-index: 5

Rulin Shao

Citations: 97

h-index: 3

Zoey Hao

Citations: 1

h-index: 1

Melanie Sclar

Citations: 14

h-index: 2

Faeze Brahman

Citations: 1,839

h-index: 16

Yike Wang

University of California, Berkeley

Citations: 514

h-index: 8

Sewoong Oh

Citations: 15

h-index: 1

언어 모델은 사전 학습 과정에서 상당한 평가 지식을 내재하고 있지만, 현재의 추가 학습 방법은 보상 신호를 생성하기 위해 외부 감독(인간 주석, 독점 모델 또는 스칼라 보상 모델)에 의존합니다. 이러한 외부 감독 방식은 각각 한계를 가지고 있습니다. 인간의 판단은 자체 능력을 초월하는 부분을 감독할 수 없으며, 독점 API는 의존성을 야기하고, 검증 가능한 보상은 정답이 있는 영역에만 적용됩니다. 모델 자체의 평가 능력을 활용한 자기 개선은 모델 자체에 따라 확장 가능한 보상 원천이지만, 현재의 방법으로는 아직 충분히 활용되지 못하고 있습니다. 본 논문에서는 EVOLM이라는 추가 학습 방법을 제시합니다. EVOLM은 모델의 평가 능력을 명시적인 판별 기준으로 구조화하고, 이를 학습 신호로 활용합니다. EVOLM은 단일 언어 모델 내에서 두 가지 능력을 번갈아 가며 학습합니다. (1) 특정 인스턴스에 대한 최적화된 평가 기준을 생성하는 기준 생성기 (이 기준은 작은 고정된 평가 모델이 선호하는 응답과 선호하지 않는 응답을 구별하는 데 사용됩니다.) 및 (2) 생성된 기준에 기반하여 보상을 받는 정책을 학습합니다. 모든 선호도 신호는 정책 자체의 출력으로부터 시간적 대비를 통해 생성되며, 인간 주석이나 외부 감독이 필요하지 않습니다. EVOLM은 Qwen3-8B 모델을 학습하여 생성된 기준이 GPT-4.1보다 RewardBench-2에서 25.7% 더 우수한 성능을 보임을 입증했습니다. 공동 학습된 정책은 OLMo3-Adapt 스위트에서 평균 69.3%의 정확도를 달성하여 GPT-4.1 프롬프트 기준을 사용한 정책보다 3.9% 높고, 최첨단 8B 보상 모델인 SkyWork-RM보다 16% 높은 성능을 보였습니다. 전반적으로, EVOLM은 모델의 평가 능력을 공진화하는 판별 기준으로 구조화함으로써 외부 감독 없이 자기 개선이 가능함을 보여줍니다.

Original Abstract

Language models encode substantial evaluative knowledge from pretraining, yet current post-training methods rely on external supervision (human annotations, proprietary models, or scalar reward models) to produce reward signals. Each imposes a ceiling. Human judgment cannot supervise capabilities beyond its own, proprietary APIs create dependencies, and verifiable rewards cover only domains with ground-truth answers. Self-improvement from a model's own evaluative capacity is a reward source that scales with the model itself, yet remains largely untapped by current methods. We introduce EVOLM, a post-training method that structures this capacity into explicit discriminative rubrics and uses them as training signal. EVOLM trains two capabilities within a single language model in alternation: (1) a rubric generator producing instance-specific evaluation criteria optimized for discriminative utility, which maximizes a small frozen judge's ability to distinguish preferred from dispreferred responses; and (2) a policy trained using those rubric-conditioned scores as reward. All preference signals are constructed from the policy's own outputs via temporal contrast with earlier checkpoints, requiring no human annotation or external supervision. EVOLM trains a Qwen3-8B model to generate rubrics that outperform GPT-4.1 on RewardBench-2 by 25.7%. The co-trained policy achieves 69.3% average on the OLMo3-Adapt suite, outperforming policies trained with GPT-4.1 prompted rubrics by 3.9% and with the state-of-the-art 8B reward model SkyWork-RM by 16%. Overall, EVOLM demonstrates that structuring a model's evaluative capacity into co-evolving discriminative rubrics enables self-improvement without external supervision.

1 Citations

1 Influential

8 Altmetric

43.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!