2601.21468v4 Jan 29, 2026 cs.AI

MemOCR: 레이아웃 기반 시각적 메모리를 활용한 효율적인 장기 추론

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Yaorui Shi

Citations: 300

h-index: 9

Yu Yang

Citations: 14

h-index: 2

Wenyu Mao

Citations: 112

h-index: 6

Yuxin Chen

Citations: 400

h-index: 8

Xunliang Cai

Citations: 74

h-index: 5

Xiang Wang

Citations: 126

h-index: 6

Hui Su

Citations: 126

h-index: 6

Shugui Liu

Citations: 272

h-index: 10

An Zhang

Citations: 196

h-index: 8

Qi Gu

Citations: 82

h-index: 5

장기적인 에이전트 추론은 증가하는 상호작용 기록을 제한된 컨텍스트 창 내에 효과적으로 압축하는 것을 필요로 합니다. 대부분의 기존 메모리 시스템은 기록을 텍스트로 직렬화하는데, 이때 토큰 단위 비용이 균일하고 길이에 따라 선형적으로 증가하여, 종종 제한된 자원이 저가치 세부 사항에 사용됩니다. 이에, 우리는 MemOCR을 제안합니다. MemOCR은 시각적 레이아웃을 통해 적응적인 정보 밀도로 메모리 공간을 할당하여, 제한된 컨텍스트 예산 하에서 장기 추론 성능을 향상시키는 다중 모드 메모리 에이전트입니다. 구체적으로, MemOCR은 구조화된 풍부한 텍스트 메모리(예: 제목, 강조 표시)를 유지하고, 에이전트가 메모리 접근을 위해 참조하는 이미지로 렌더링합니다. 이를 통해 중요한 증거를 시각적으로 우선시하고, 동시에 부가적인 세부 정보를 적극적으로 압축합니다. MemOCR은 다양한 메모리 예산 환경에서 견고성을 확보하기 위해, 에이전트가 다양한 압축 수준에 노출되도록 예산 기반 목표를 사용하여 강화 학습으로 훈련되었습니다. MemOCR은 장문 컨텍스트 기반의 다중 홉 및 단일 홉 질문-응답 벤치마크에서 강력한 텍스트 기반 모델을 능가하며, 극단적인 예산 환경에서도 더 효과적인 컨텍스트 활용을 달성합니다.

Original Abstract

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

7 Citations

1 Influential

5 Altmetric

34.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!