2603.25333v1 Mar 26, 2026 cs.CL

적응형 청킹: RAG을 위한 청킹 방법 선택 최적화

Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

Jean Lelong

Citations: 3

h-index: 1

A. Blangero

Citations: 1,252

h-index: 22

Paulo Roberto Montenegro de Albuquerque Júnior

Citations: 0

h-index: 0

검색 증강 생성(RAG)의 효과는 문서가 인덱싱 및 검색을 위해 어떻게 분할되는지, 즉 청킹 방법에 크게 의존합니다. 그러나 일반적으로 사용되는 '모든 것에 적용 가능한' 방식은 다양한 텍스트의 미묘한 구조와 의미를 제대로 포착하지 못하는 경우가 많습니다. 청킹은 매우 중요한 역할을 하지만, 다운스트림 성능에 의존하지 않고 독립적으로 평가할 수 있는 전용 평가 프레임워크가 부족합니다. 우리는 이 기존 방식을 도전하여, 각 문서에 가장 적합한 청킹 전략을 선택하는 프레임워크인 '적응형 청킹(Adaptive Chunking)'을 소개합니다. 이 프레임워크는 References Completeness (RC, 참고문헌 완전성), Intrachunk Cohesion (ICC, 청킹 단위 내 응집성), Document Contextual Coherence (DCC, 문서 맥락 일관성), Block Integrity (BI, 블록 완전성), Size Compliance (SC, 크기 준수)라는 다섯 가지 새로운, 문서 기반의 내재적 지표를 사용하여 청킹 품질을 다양한 측면에서 직접 평가합니다. 이 프레임워크를 지원하기 위해, LLM 정규 표현식 분할기(LLM-regex splitter)와 분할 후 병합 재귀 분할기(split-then-merge recursive splitter)라는 두 가지 새로운 청킹 도구를 개발하고, 목표 지향적인 후처리 기술도 함께 제공합니다. 법률, 기술, 사회과학 분야를 포괄하는 다양한 데이터 세트에서, 우리의 지표 기반 적응형 방법은 다운스트림 RAG 성능을 크게 향상시켰습니다. 모델이나 프롬프트를 변경하지 않고도, 우리의 프레임워크는 RAG 결과를 개선하여 정답률을 72%까지 높였습니다 (62-64%에서). 또한, 성공적으로 답변된 질문의 수를 30% 이상 증가시켰습니다 (65개 vs. 49개). 이러한 결과는 문서 인식적이고 적응적인 청킹이, 상호 보완적인 내재적 지표 세트를 통해 안내될 때, 보다 강력한 RAG 시스템을 구축하는 실용적이고 효과적인 방법을 제공한다는 것을 보여줍니다. 코드: https://github.com/ekimetrics/adaptive-chunking

Original Abstract

The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.

0 Citations

0 Influential

37.931471805599 Altmetric

189.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!