2601.21654v3 Jan 29, 2026 cs.AI

ScholarGym: 심층 연구의 정보 수집 단계에서 대규모 언어 모델의 역량 벤치마킹

ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research

Hao Shen

Citations: 8

h-index: 2

Hang Yang

Citations: 13

h-index: 2

Zhouhong Gu

Citations: 9

h-index: 2

Weili Han

Citations: 5

h-index: 1

대규모 언어 모델은 단순한 질문 답변에서 발전하여, 연구 질문을 반복적으로 분해하고, 검색 도구를 활용하며, 여러 단계에 걸쳐 정보를 종합하는 심층 연구 시스템으로 진화했습니다. 이러한 시스템을 평가하는 일반적인 방법은 최종 연구 보고서를 전체적으로 평가하는 것이지만, 이러한 엔드투엔드 방식은 언어 모델의 의사 결정, 워크플로우 설계 및 환경 피드백을 밀접하게 연결하여 개별 구성 요소에 대한 분해 가능한 분석을 어렵게 만듭니다. 본 논문에서는 학술 문헌에 대한 심층 연구의 정보 수집 단계를 분리하는 평가 환경인 ScholarGym을 소개합니다. ScholarGym은 통일된 워크플로우 하에서 연구 과정을 세 가지 명시적인 단계(쿼리 계획, 도구 호출 및 관련성 평가)로 분해하고, 각 단계를 57만 건의 논문으로 구성된 정적 데이터베이스에서 2,536개의 전문가가 주석을 단, 쿼리에 대해 결정적인 방식으로 평가합니다. 체계적인 실험 결과, 반복적인 쿼리 분해는 단일 쿼리 검색보다 F1 점수가 2.9~3.3배 향상되며, 확장된 사고 능력을 가진 모델은 재현율을 희생하여 정밀도를 높이는 경향이 있으며, 쿼리 계획 품질과 관련성 평가가 양쪽 모두 성능 병목 현상을 일으켜 상용 모델과 오픈 소스 모델 간의 성능 차이를 구분하는 요인임을 확인했습니다.

Original Abstract

Large language models have advanced from single-turn question answering to deep research systems that iteratively decompose research questions, invoke retrieval tools, and synthesize information across multiple rounds. Evaluating such systems typically involves scoring their final research reports holistically, but this end-to-end paradigm tightly couples the language model's decision-making, workflow design, and environmental feedback, precluding decomposable analysis of individual components. We introduce ScholarGym, an evaluation environment that isolates the information-gathering stage of deep research on academic literature. Under a unified workflow, ScholarGym decomposes the research process into three explicit stages -- Query Planning, Tool Invocation, and Relevance Assessment -- and evaluates each against 2,536 expert-annotated queries over a static corpus of 570K papers with deterministic retrieval. Systematic experiments reveal that iterative query decomposition yields 2.9--3.3$\times$ F1 gains over single-query retrieval, models with extended thinking trade recall for precision, and Query Planning quality together with Relevance Assessment constitute dual bottlenecks that separate proprietary from open-source model performance.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!