2604.21223v1 Apr 23, 2026 cs.CL

암묵적 보상 모델을 활용한 LLM 생성 텍스트의 제로샷 탐지

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Runheng Liu

Citations: 28

h-index: 2

Xingchen Xiao

Citations: 2

h-index: 1

Heyan Huang

Citations: 12

h-index: 2

Zhijing Wu

Citations: 11

h-index: 2

대규모 언어 모델(LLM)은 다양한 작업에서 놀라운 능력을 보여주었지만, 인간과 유사한 텍스트를 생성하는 능력은 잠재적인 오용에 대한 우려를 불러일으키고 있습니다. 이는 LLM 생성 텍스트를 탐지하는 신뢰할 수 있고 효과적인 방법의 필요성을 강조합니다. 본 논문에서는 암묵적 보상 모델을 활용하여 LLM 생성 텍스트를 탐지하는 새로운 제로샷 접근 방식인 IRM을 제안합니다. 이러한 암묵적 보상 모델은 공개적으로 이용 가능한 instruction-tuned 및 기본 모델에서 파생될 수 있습니다. 기존의 보상 기반 방법은 선호도 구축 및 작업별 미세 조정에 의존하는 반면, IRM은 선호도 수집이나 추가적인 학습이 필요하지 않습니다. 우리는 DetectRL 벤치마크에서 IRM을 평가하고, IRM이 기존의 제로샷 및 지도 학습 방법보다 우수한 탐지 성능을 달성한다는 것을 입증했습니다.

Original Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.

1 Citations

1 Influential

1 Altmetric

8.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!