2605.05242v1 May 03, 2026 cs.IR

의미적 유사성을 넘어: 직접적인 코퍼스 상호 작용을 통한 에이전트 기반 검색을 위한 검색 방식 재고

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Ming Zhong

Citations: 175

h-index: 6

Haoxiang Zhang

Citations: 81

h-index: 4

Wenhu Chen

Citations: 318

h-index: 9

Jiawei Han

Citations: 226

h-index: 6

Yuyu Zhang

Citations: 106

h-index: 4

Dongfu Jiang

University of Waterloo

Citations: 4,100

h-index: 14

Yu Zhang

University of Illinois at Urbana-Champaign

Citations: 2,681

h-index: 27

Cong Wei

Citations: 3,493

h-index: 13

Pan Lu

Citations: 235

h-index: 3

James Zou

Citations: 58

h-index: 3

Zhuofeng Li

Citations: 57

h-index: 2

Ping Nie

Citations: 20

h-index: 2

Yi Lu

Citations: 25

h-index: 3

Yuyang Bai

Citations: 357

h-index: 7

Shangbin Feng

Citations: 23

h-index: 3

Hangxiao Zhu

Citations: 11

h-index: 2

Jianwen Xie

Citations: 16

h-index: 2

Yejin Choi

Citations: 9

h-index: 1

Jimmy Lin

Citations: 72

h-index: 4

현대의 검색 시스템은 어휘 기반이든 의미 기반이든, 코퍼스를 고정된 유사성 인터페이스를 통해 제공하며, 추론 전에 단일한 상위 k개 항목 검색 단계로 접근을 제한합니다. 이러한 추상화는 효율적이지만, 에이전트 기반 검색에서는 병목 현상으로 작용합니다. 정확한 어휘 제약 조건, 희소한 단서 결합, 로컬 컨텍스트 확인, 그리고 다단계 가설 개선은 기존의 표준 검색기를 사용하는 것만으로는 구현하기 어렵습니다. 또한, 초기 단계에서 걸러진 증거는 강력한 후속 추론을 통해 복구할 수 없습니다. 에이전트 기반 작업은 더욱 이러한 한계를 악화시키는데, 왜냐하면 에이전트는 중간 개체 발견, 약한 단서 결합, 그리고 부분적인 증거를 관찰한 후 계획 수정과 같은 여러 단계를 수행해야 하기 때문입니다. 이러한 한계를 극복하기 위해, 우리는 임베딩 모델, 벡터 인덱스, 또는 검색 API 없이, 일반적인 터미널 도구(예: grep, 파일 읽기, 셸 명령어, 경량 스크립트)를 사용하여 에이전트가 코퍼스를 직접 검색하는 '직접 코퍼스 상호 작용(DCI)' 방식을 연구합니다. 이 접근 방식은 오프라인 인덱싱이 필요 없으며, 변화하는 로컬 코퍼스에 자연스럽게 적응합니다. 다양한 정보 검색 벤치마크 및 엔드 투 엔드 에이전트 기반 검색 작업에서, 이 간단한 설정은 여러 BRIGHT 및 BEIR 데이터셋에서 강력한 희소, 밀집, 그리고 재순위화 기준 모델보다 훨씬 뛰어난 성능을 보이며, BrowseComp-Plus 및 멀티 홉 질의 응답에서 기존의 의미 기반 검색기 없이도 높은 정확도를 달성합니다. 우리의 결과는 언어 에이전트가 더욱 강력해짐에 따라, 검색 품질이 추론 능력뿐만 아니라 모델이 코퍼스와 상호 작용하는 인터페이스의 해상도에 따라 달라진다는 것을 보여줍니다. DCI는 에이전트 기반 검색을 위한 더 넓은 인터페이스 설계 공간을 제공합니다.

Original Abstract

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.

11 Citations

4 Influential

13.5 Altmetric

86.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!