2601.04786v2 Jan 08, 2026 cs.LG

AgentOCR: 광학적 자기 압축을 통한 에이전트 이력 재구성

AgentOCR: Reimagining Agent History via Optical Self-Compression

Zhenglin Wan

Citations: 25

h-index: 3

Lang Feng

Citations: 729

h-index: 11

Fuchao Yang

Citations: 23

h-index: 3

Bo An

Citations: 18

h-index: 1

Feng Chen

Citations: 30

h-index: 3

Xin Cheng

Citations: 19

h-index: 2

Haiyang Xu

Citations: 20

h-index: 2

Ming Yan

Citations: 145

h-index: 2

최근 거대 언어 모델(LLM)의 발전은 강화 학습(RL)을 통해 다중 상호 작용 경로로 훈련된 에이전트 시스템을 가능하게 하지만, 실제 적용은 빠르게 증가하는 텍스트 이력으로 인해 토큰 예산과 메모리 사용량이 증가하면서 어려움을 겪고 있습니다. 본 논문에서는 시각적 토큰의 높은 정보 밀도를 활용하여 누적된 관찰-행동 이력을 압축된 이미지로 표현하는 프레임워크인 AgentOCR을 소개합니다. AgentOCR은 다중 턴 시뮬레이션을 확장 가능하게 하기 위해 광학적 캐싱을 제안합니다. 이 기법은 이력을 해시 가능한 세그먼트로 분해하고 시각적 캐시를 유지하여 불필요한 렌더링을 제거합니다. 또한, AgentOCR은 고정된 렌더링 외에도 에이전트 자체 압축 기능을 도입하여 에이전트가 압축률을 적극적으로 생성하고, 압축을 고려한 보상을 통해 훈련하여 작업 성공률과 토큰 효율성 간의 균형을 적응적으로 조절합니다. 우리는 ALFWorld 및 검색 기반 질의응답과 같은 어려운 에이전트 벤치마크에서 광범위한 실험을 수행했습니다. 놀랍게도, 실험 결과는 AgentOCR이 텍스트 기반 에이전트 성능의 95% 이상을 유지하면서 토큰 소비량을 크게 줄여줍니다(>50%), 결과적으로 일관된 토큰 및 메모리 효율성을 제공합니다. 추가 분석 결과, 세그먼트 광학적 캐싱을 통해 20배 빠른 렌더링 속도를 얻을 수 있으며, 자체 압축을 통해 효과적인 전략적 균형을 이루는 것을 확인할 수 있습니다.

Original Abstract

Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.

17 Citations

1 Influential

5.5 Altmetric

46.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!