2601.21937v1 Jan 29, 2026 cs.AI

검색 주입 추론 샌드박스: 검색 및 추론 능력 분리를 위한 벤치마크

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Yixin Cao

Citations: 59

h-index: 4

Zhongyuan Peng

Citations: 13

h-index: 2

Shuangshuang Ying

Citations: 8

h-index: 1

Jin Chen

Citations: 4

h-index: 1

Siyi Liu

Citations: 2

h-index: 1

Yinzhu Piao

Citations: 1

h-index: 1

Yuchen Wu

Citations: 6

h-index: 1

Xin Gui

Citations: 31

h-index: 2

Xeron Du

Citations: 9

h-index: 1

Ge Zhang

Citations: 27

h-index: 2

Stephen Huang

Citations: 2

h-index: 1

Z. Wang

Citations: 71

h-index: 5

Yun Peng

Citations: 78

h-index: 3

Yuhao Wu

Citations: 405

h-index: 7

Hongbin Lin

Citations: 10

h-index: 2

Di He

Citations: 513

h-index: 8

Xin Li

Citations: 127

h-index: 6

Libo Qin

Citations: 66

h-index: 3

Gengchen Yu

Citations: 153

h-index: 6

기존 벤치마크에서의 뛰어난 성능에도 불구하고, 거대 언어 모델이 진정으로 새로운 과학 정보를 바탕으로 추론할 수 있는지는 여전히 불분명합니다. 대부분의 평가는 추론이 검색 및 도구 사슬(toolchain) 선택과 뒤섞여 있고, 매개변수적 암기(parametric memorization)와 오픈 웹의 변동성으로 인해 신호가 더욱 오염되는 엔드투엔드 RAG 파이프라인을 점수화합니다. 우리는 딥 서치(deep search)의 핵심 난제인 다단계 합성, 노이즈 제거, 증거 기반 결론 도출을 보존하면서 문서 기반 추론을 격리하는 통제된 심층 연구 샌드박스인 DeR2를 소개합니다. DeR2는 지침 전용(Instruction-only), 개념(문서 없는 정답 개념), 관련 문서 전용(Related-only), 전체 세트(관련 문서 및 주제별 방해 요소 포함)의 네 가지 레짐을 통해 증거 접근을 추론과 분리합니다. 이는 검색 손실 대 추론 손실을 구체화하고 세밀한 오류 귀인을 가능하게 하는 해석 가능한 레짐 간 격차를 제공합니다. 매개변수적 정보 유출을 방지하기 위해, 우리는 증거 없이는 실패하지만 오라클 개념으로는 해결 가능함을 보장하는 2단계 검증을 적용합니다. 재현성을 보장하기 위해 각 인스턴스는 전문가가 주석을 단 개념 및 검증된 근거와 함께 (2023~2025년 이론 논문에서 추출한) 고정된 문서 라이브러리를 제공합니다. 다양한 최신 파운데이션 모델에 대한 실험 결과 상당한 편차와 발전 가능성이 드러났습니다. 일부 모델은 전체 세트에서 지침 전용보다 성능이 떨어지는 모드 전환 취약성을 보였으며, 다른 모델들은 개념을 올바르게 명명하지만 절차적으로 실행하지 못하는 구조적 개념 오용을 보였습니다.

Original Abstract

Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!