2601.21468v1 Jan 29, 2026 cs.AI

MemOCR: 효율적인 장기 추론을 위한 레이아웃 인식 시각적 메모리

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Yaorui Shi

Citations: 300

h-index: 9

Yu Yang

Citations: 14

h-index: 2

Wenyu Mao

Citations: 112

h-index: 6

Yuxin Chen

Citations: 400

h-index: 8

Xunliang Cai

Citations: 74

h-index: 5

Xiang Wang

Citations: 126

h-index: 6

Hui Su

Citations: 126

h-index: 6

Shugui Liu

Citations: 272

h-index: 10

An Zhang

Citations: 196

h-index: 8

Qi Gu

Citations: 82

h-index: 5

장기적 에이전트 추론(Long-horizon agentic reasoning)은 점점 늘어나는 상호작용 이력을 제한된 컨텍스트 윈도우 내로 효과적으로 압축해야 합니다. 대부분의 기존 메모리 시스템은 이력을 텍스트로 직렬화하는데, 이는 토큰 수준 비용이 균일하고 길이에 따라 선형적으로 증가하여, 종종 가치가 낮은 세부 사항에 부족한 예산을 낭비하게 됩니다. 이를 해결하기 위해, 우리는 시각적 레이아웃을 통해 메모리 공간에 정보 밀도를 적응적으로 할당함으로써 제한된 컨텍스트 예산 하에서 장기 추론을 개선하는 멀티모달 메모리 에이전트인 MemOCR을 소개합니다. 구체적으로 MemOCR은 구조화된 리치 텍스트(rich-text) 메모리(예: 제목, 강조 표시)를 유지하고 이를 이미지로 렌더링하여 에이전트가 메모리 접근 시 참조하게 함으로써, 중요한 증거는 시각적으로 우선순위를 두고 부수적인 세부 사항은 과감하게 압축합니다. 다양한 메모리 예산에 대한 강건성을 보장하기 위해, 우리는 에이전트를 다양한 압축 수준에 노출시키는 예산 인식 목표 하에서 강화 학습을 통해 MemOCR을 훈련시킵니다. 긴 컨텍스트의 멀티 홉(multi-hop) 및 싱글 홉(single-hop) 질의응답 벤치마크에서 MemOCR은 강력한 텍스트 기반 베이스라인들을 능가하며, 극한의 예산 제약 하에서도 더 효과적인 컨텍스트 활용 능력을 달성했습니다.

Original Abstract

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

7 Citations

1 Influential

5 Altmetric

34.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!