2601.11863v1 Jan 17, 2026 cs.IR

메타데이터를 활용한 향상된 검색 기반 생성

Utilizing Metadata for Better Retrieval-Augmented Generation

Raquib Bin Yousuf

Citations: 34

h-index: 4

Shengzhe Xu

Citations: 214

h-index: 6

Mandar Sharma

Citations: 3,331

h-index: 6

Andrew Neeser

Citations: 23

h-index: 2

Chris Latimer

Citations: 21

h-index: 2

Naren Ramakrishnan

Citations: 37

h-index: 4

검색 기반 생성(Retrieval-Augmented Generation, RAG) 시스템은 대규모 언어 모델로부터 정확하고 신뢰성 있는 출력을 생성하기 위해 의미적으로 관련된 문서 조각을 검색하는 데 의존합니다. 규제 서류와 같이 구조화되고 반복적인 문서 집합에서, 문서 조각 간의 유사성만으로는 종종 겹치는 어휘를 가진 문서들을 구별하는 데 실패합니다. 실무자들은 종종 메타데이터를 입력 텍스트로 변환하여 사용하지만, 이러한 방식의 효과와 장단점에 대한 이해는 아직 부족합니다. 본 연구에서는 메타데이터를 고려한 검색 전략에 대한 체계적인 연구를 수행하고, 일반 텍스트 기반의 방법을 메타데이터를 직접 포함하는 방법들과 비교합니다. 우리의 평가는 텍스트로 변환된 메타데이터(접두사 및 접미사), 메타데이터와 콘텐츠를 단일 인덱스에 융합하는 이중 인코더 기반의 통합 임베딩, 이중 인코더 기반의 후처리 검색, 그리고 메타데이터를 고려한 질의 재구성 방법 등을 포함합니다. 다양한 검색 지표와 질문 유형에 걸쳐, 접두사 방식과 통합 임베딩 방식이 일반 텍스트 기반 방법보다 일관되게 우수한 성능을 보였으며, 통합 임베딩 방식은 때때로 접두사 방식보다 더 뛰어난 성능을 보이면서도 유지 관리가 더 용이했습니다. 경험적 비교 외에도, 임베딩 공간을 분석하여 메타데이터 통합이 문서 내부의 응집력을 높이고, 문서 간의 혼란을 줄이며, 관련 및 관련 없는 조각 간의 분리를 넓혀 효과를 향상시킨다는 것을 보여줍니다. 필드 단위의 분석 결과, 구조적 단서가 강력한 구별 신호를 제공한다는 것을 확인했습니다. 본 연구의 코드, 평가 프레임워크 및 RAGMATE-10K 데이터셋은 공개적으로 제공됩니다.

Original Abstract

Retrieval-Augmented Generation systems depend on retrieving semantically relevant document chunks to support accurate, grounded outputs from large language models. In structured and repetitive corpora such as regulatory filings, chunk similarity alone often fails to distinguish between documents with overlapping language. Practitioners often flatten metadata into input text as a heuristic, but the impact and trade-offs of this practice remain poorly understood. We present a systematic study of metadata-aware retrieval strategies, comparing plain-text baselines with approaches that embed metadata directly. Our evaluation spans metadata-as-text (prefix and suffix), a dual-encoder unified embedding that fuses metadata and content in a single index, dual-encoder late-fusion retrieval, and metadata-aware query reformulation. Across multiple retrieval metrics and question types, we find that prefixing and unified embeddings consistently outperform plain-text baselines, with the unified at times exceeding prefixing while being easier to maintain. Beyond empirical comparisons, we analyze embedding space, showing that metadata integration improves effectiveness by increasing intra-document cohesion, reducing inter-document confusion, and widening the separation between relevant and irrelevant chunks. Field-level ablations show that structural cues provide strong disambiguating signals. Our code, evaluation framework, and the RAGMATE-10K dataset are publicly hosted.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!