2601.14952v1 Jan 21, 2026 cs.CL

CorpusQA: 1천만 토큰 규모의 벤치마크 - 코퍼스 레벨 분석 및 추론을 위한

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

Weizhou Shen

Citations: 955

h-index: 11

Chenliang Li

Citations: 52

h-index: 4

Ming Yan

Citations: 166

h-index: 3

Zhiyuan Lu

Citations: 11

h-index: 1

Yingcheng Shi

Citations: 40

h-index: 2

Fei Huang

Citations: 11

h-index: 1

최근의 대규모 언어 모델은 백만 토큰 규모의 문맥을 처리할 수 있지만, 전체 문서 저장소에 대한 추론 능력은 아직 충분히 검증되지 않았습니다. 기존의 벤치마크는 대부분 단일의 긴 텍스트에 국한되거나, '희소 검색(sparse retrieval)'이라는 가정에 의존합니다. 즉, 답변은 몇 개의 관련 부분에서 얻을 수 있다는 가정입니다. 하지만 이러한 가정은 수백 개의 문서에 흩어져 있는 증거를 활용하여 전반적인 통합, 비교 및 통계적 집계를 요구하는 진정한 코퍼스 레벨 분석에는 적합하지 않습니다. 이러한 중요한 격차를 해결하기 위해, 우리는 1천만 토큰 규모의 새로운 벤치마크인 CorpusQA를 제안합니다. 이는 새로운 데이터 합성 프레임워크를 통해 생성되었습니다. 이 프레임워크는 추론과 텍스트 표현을 분리하여, 프로그래밍적으로 검증된 정답을 갖는 복잡하고 계산 집약적인 쿼리를 생성합니다. 이를 통해 시스템은 인간의 주석에 의존하지 않고 방대한 비정형 텍스트에 대한 종합적인 추론을 수행하도록 도전합니다. 또한, 우리는 이 프레임워크가 평가뿐만 아니라 LLM의 일반적인 장문 추론 능력을 향상시키는 데에도 효과적임을 보여줍니다. 광범위한 실험 결과, 최첨단 장문 LLM조차 입력 길이가 증가함에 따라 어려움을 겪으며, 표준적인 검색 증강 생성 시스템은 완전히 붕괴하는 것으로 나타났습니다. 이러한 결과는 메모리 기반 에이전트 아키텍처가 보다 강력한 대안을 제공하며, 단순히 문맥 창을 확장하는 것에서 벗어나 전역 정보 합성을 위한 고급 아키텍처 개발이 필요하다는 것을 시사합니다.

Original Abstract

While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a "sparse retrieval" assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!