2601.18207v1 Jan 26, 2026 cs.LG

PaperSearchQA: 강화 학습 및 검증 가능한 보상(RLVR)을 활용하여 과학 논문을 검색하고 추론하는 방법

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

Yuhui Zhang

Stanford University

Citations: 13,149

h-index: 21

S. Yeung-Levy

Citations: 1,713

h-index: 22

James Burgess

Citations: 313

h-index: 11

Jan N. Hansen

Citations: 39

h-index: 3

Duo Peng

Citations: 8

h-index: 2

Alejandro Lozano

Citations: 217

h-index: 8

M. Sun

Citations: 41

h-index: 4

Emma Lundberg

Citations: 2,117

h-index: 7

검색 에이전트는 질문에 답하기 위해 지식 기반(또는 웹)을 검색하고 추론하는 언어 모델(LM)입니다. 최근 방법들은 강화 학습과 검증 가능한 보상(RLVR)을 사용하여 최종 답변의 정확성만을 감독합니다. 대부분의 RLVR 검색 에이전트는 일반 도메인 질의응답 문제를 다루므로, 과학, 공학, 의학 분야의 기술적인 AI 시스템에 대한 관련성이 제한됩니다. 본 연구에서는 에이전트를 훈련시켜 과학 논문을 검색하고 추론하도록 제안합니다. 이는 기술적인 질의응답 능력을 테스트하고, 실제 과학자에게 직접적으로 관련되며, 미래의 AI 과학자 시스템에 필수적인 기능을 제공합니다. 구체적으로, 1600만 건의 생의학 논문 초록으로 구성된 검색 코퍼스를 공개하고, 코퍼스에서 답변 가능한 6만 건의 샘플로 구성된 PaperSearchQA라는 어려운 사실 기반 질의응답 데이터셋과 벤치마크를 구축했습니다. 우리는 이 환경에서 훈련된 검색 에이전트가 강화 학습을 사용하지 않은 기존 검색 방법보다 우수한 성능을 보이는 것을 확인했습니다. 또한, 추가적인 정량적 분석을 수행하고, 에이전트의 계획, 추론, 자기 검증과 같은 흥미로운 행동 패턴을 관찰했습니다. 우리의 코퍼스, 데이터셋, 벤치마크는 널리 사용되는 Search-R1 코드베이스를 사용하여 RLVR 훈련에 활용할 수 있으며, https://huggingface.co/collections/jmhb/papersearchqa 에서 공개됩니다. 또한, 데이터 생성 방법은 확장 가능하며, 다른 과학 분야에도 쉽게 적용할 수 있습니다.

Original Abstract

Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers -- this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training and released on https://huggingface.co/collections/jmhb/papersearchqa. Finally, our data creation methods are scalable and easily extendable to other scientific domains.

4 Citations

0 Influential

31 Altmetric

159.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!