2605.00505v1 May 01, 2026 cs.IR

LLM 기반 정보 검색: 노이즈 제거를 최우선으로 하는 관점

LLM-Oriented Information Retrieval: A Denoising-First Perspective

Fanpu Cao

Citations: 17

h-index: 3

Lu Dai

Citations: 60

h-index: 5

Hui Xiong

Citations: 31

h-index: 4

Cehao Yang

Citations: 136

h-index: 5

Liangtai Sun

Citations: 396

h-index: 7

Ziyang Rao

Citations: 8

h-index: 1

Hao Liu

Citations: 81

h-index: 5

현대 정보 검색(IR)은 과거처럼 주로 인간에 의해 소비되는 것이 아니라, 검색 증강 생성(RAG) 및 에이전트 기반 검색을 통해 점점 더 많은 대규모 언어 모델(LLM)에 의해 활용되고 있습니다. 인간 사용자와 달리, LLM은 제한된 어텐션 예산으로 인해 제약되며, 노이즈에 특히 취약합니다. 오해를 불러일으키거나 관련 없는 정보는 단순한 불편함이 아니라, 환각 및 추론 오류의 직접적인 원인이 됩니다. 본 논문에서는 노이즈 제거를 극대화하여 맥락 창 내에서 활용 가능한 증거의 밀도와 검증 가능성을 높이는 것이, 전체 정보 접근 파이프라인에서 가장 중요한 제약 요소가 되고 있다고 주장합니다. 우리는 정보 검색의 과제들을 접근 불가능, 발견 불가능, 불일치, 그리고 최종적으로 검증 불가능의 네 단계로 나누어 이러한 패러다임의 변화를 설명합니다. 또한, 인덱싱, 검색, 맥락 엔지니어링, 검증, 그리고 에이전트 기반 워크플로우를 아우르는 신호-노이즈 최적화 기술에 대한 체계적인 분류를 제시합니다. 또한, 지속적인 지원, 코딩 에이전트, 심층 연구, 그리고 다중 모드 이해와 같이 정보 검색에 크게 의존하는 분야에서의 정보 노이즈 제거 연구 사례를 소개합니다.

Original Abstract

Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full information access pipeline. We conceptualize this paradigm shift through a four-stage framework of IR challenges: from inaccessible to undiscoverable, to misaligned, and finally to unverifiable. Furthermore, we provide a pipeline-organized taxonomy of signal-to-noise optimization techniques, spanning indexing, retrieval, context engineering, verification, and agentic workflow. We also present research works on information denoising in domains that rely heavily on retrieval such as lifelong assistant, coding agent, deep research, and multimodal understanding.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!