2604.11543v1 Apr 13, 2026 cs.CL

NovBench: 대규모 언어 모델의 학술 논문 참신성 평가

NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

Yi Zhao

Citations: 154

h-index: 9

Yuzhuo Wang

Citations: 444

h-index: 10

Wenqing Wu

Citations: 45

h-index: 3

Siyou Li

Citations: 121

h-index: 2

Juexi Shao

Citations: 5

h-index: 2

Yunfei Long

Citations: 5

h-index: 1

Chengzhi Zhang

Citations: 5

h-index: 1

학술 출판에서 참신성은 핵심적인 요구 사항이며, 동료 심사의 중심적인 초점입니다. 그러나 제출 건수가 증가함에 따라, 인간 심사자들에게 가해지는 부담이 커지고 있습니다. 동료 심사 데이터를 기반으로 튜닝된 모델을 포함한 대규모 언어 모델(LLM)은 심사 의견 생성에 유망한 가능성을 보여주었지만, 연구의 참신성을 평가하는 능력에 대한 체계적인 평가를 위한 전용 벤치마크의 부재로 인해 그 잠재력이 제한되었습니다. 이러한 격차를 해소하기 위해, 우리는 인간 동료 심사를 지원하기 위해 LLM이 참신성 평가를 생성하는 능력을 평가하도록 설계된 최초의 대규모 벤치마크인 NovBench를 소개합니다. NovBench는 주요 자연어 처리 학회에서 수집된 1,684개의 논문-심사 페어 데이터셋으로 구성되어 있으며, 여기에는 논문 서론에서 추출한 참신성 설명과 해당 전문가가 작성한 참신성 평가가 포함됩니다. 우리는 서론과 전문가가 작성한 참신성 평가 모두에 주목하는데, 서론은 참신성 주장을 표준화되고 명시적으로 표현하는 반면, 전문가가 작성한 참신성 평가는 현재 인간 판단의 가장 신뢰할 수 있는 기준으로 간주되기 때문입니다. 또한, LLM이 생성한 참신성 평가의 품질을 평가하기 위한 4가지 차원(관련성, 정확성, 포괄성, 명확성) 평가 프레임워크를 제안합니다. 다양한 프롬프트 전략을 사용하여 일반 LLM과 특수 LLM을 대상으로 수행된 광범위한 실험 결과, 현재 모델은 과학적 참신성에 대한 이해가 제한적이며, 튜닝된 모델은 종종 지침 준수 부족 문제를 겪는다는 것을 보여줍니다. 이러한 결과는 참신성 이해와 지침 준수를 동시에 개선할 수 있는 목표 지향적인 튜닝 전략의 필요성을 강조합니다.

Original Abstract

Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine--tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!