2603.12246v1 Mar 12, 2026 cs.AI

검증 불가능한 LLM 추가 학습에서 추론 LLM을 평가자로 활용하는 연구

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Xuewei Wang

Citations: 14,786

h-index: 8

Zhengxing Chen

Citations: 14,752

h-index: 8

Arman Cohan

Citations: 19

h-index: 2

DiJia Su

Citations: 641

h-index: 7

Sida Wang

Citations: 29

h-index: 2

Song Jiang

Citations: 270

h-index: 8

Bo Liu

Citations: 508

h-index: 7

Yuandong Tian

Citations: 184

h-index: 3

Yixin Liu

Citations: 93

h-index: 2

Yuehua Yu

Citations: 2

h-index: 1

추론 능력을 가진 LLM을 평가자로 활용하는 방식은 추론 모델의 성공을 검증 가능성이 낮은 영역으로 확장할 수 있는 유망한 방법입니다. 이러한 평가자는 추론 시간에 따라 성능이 향상될 수 있으며, 출력의 정확성/품질을 직접 확인할 수 없는 환경에서 특히 유용합니다. 그러나 추론 평가자는 정적 평가 벤치마크에서 더 나은 성능을 보이는 것으로 나타났지만, 실제 정책 학습에서의 효과는 체계적으로 연구되지 않았습니다. 따라서, 본 연구에서는 강화 학습 기반 LLM 정렬 과정에서 검증 불가능한 평가자와 추론 평가자의 실제 영향을 조사하기 위한 엄밀한 연구를 수행했습니다. '기준' 평가자(gpt-oss-120b)가 작은 평가자 훈련에 필요한 선호도 정보를 제공하는 통제된 합성 환경에서, 우리는 검증 불가능한 평가자와 추론 평가자 간의 중요한 차이점을 발견했습니다. 검증 불가능한 평가자는 쉽게 '보상 해킹'을 유발하는 반면, 추론 평가자는 기준 평가자에 의해 평가될 때 강력한 성능을 보이는 정책을 학습시킬 수 있습니다. 흥미롭게도, 추론 평가자를 사용하여 훈련된 정책은 다른 LLM 평가자를 속이는 데 매우 효과적인 적대적 출력을 생성하는 방법을 학습함으로써 이러한 강력한 성능을 달성합니다. 또한, 이러한 정책은 인기 있는 벤치마크인 Arena-Hard에서도 높은 점수를 얻습니다. 본 연구의 추가 분석과 함께, 이러한 결과는 검증 불가능한 LLM 추가 학습에 (추론) LLM 평가자를 적용하는 데 있어 중요한 발견과 개선점을 제시합니다.

Original Abstract

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.

2 Citations

0 Influential

4 Altmetric

22.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!