2602.08382v1 Feb 09, 2026 cs.CL

종단 간 강화 학습을 통한 압축된 메모리를 활용한 동적 장문 맥락 추론

Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

Meishan Zhang

Citations: 790

h-index: 12

Zhuoen Chen

Citations: 14

h-index: 2

Baotian Hu

Citations: 388

h-index: 10

Min Zhang

Citations: 473

h-index: 11

Dongfang Li

Citations: 367

h-index: 10

대규모 언어 모델(LLM)은 장문 맥락 처리에서 상당한 어려움을 겪으며, 여기에는 2차 함수에 비례하는 계산 비용, 정보 손실, 그리고 검색 증강 생성(RAG)에 내재된 맥락 단편화 등이 포함됩니다. 본 연구에서는 모든 원시 토큰을 처리하는 대신, 청크 단위 압축과 선택적 메모리 재호출을 기반으로 효율적인 장문 맥락 추론을 위한 인지적으로 영감을 받은 프레임워크를 제안합니다. 제안된 프레임워크는 긴 입력을 청크로 분할하고, 학습된 압축기를 사용하여 각 청크를 압축된 메모리 표현으로 인코딩합니다. 게이팅 모듈은 관련 메모리 블록을 동적으로 선택하고, 이들은 순차적인 작업 메모리를 갖춘 추론 모듈에 의해 반복적으로 처리되어 하위 작업 문제를 해결합니다. 압축기 및 추론기는 종단 간 강화 학습을 통해 공동으로 최적화되며, 게이팅 모듈은 개별적으로 분류기로 훈련됩니다. 실험 결과는 제안된 방법이 RULER-HQA와 같은 다단계 추론 벤치마크에서 경쟁력 있는 정확도를 달성하고, 7K 토큰에서 175만 토큰으로 맥락 길이를 확장하며, 강력한 장문 맥락 기준 모델과 비교하여 유리한 정확도-효율성 균형을 제공함을 보여줍니다. 특히, 제안된 방법은 MemAgent에 비해 최대 2배의 GPU 메모리 사용량 감소와 6배의 추론 속도 향상을 달성합니다.

Original Abstract

Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.

2 Citations

0 Influential

6 Altmetric

32.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!