2605.15198v1 May 14, 2026 cs.CV

ATLAS: 능동적인가, 잠재적인가? 시각적 추론에서 하나의 단어만으로 충분하다.

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo

Citations: 2,663

h-index: 20

Xinyan Chen

Citations: 202

h-index: 4

Rain Liu

Citations: 0

h-index: 0

Pheng-Ann Heng

Citations: 1,170

h-index: 17

시각적 추론은 종종 중간 시각 상태와 함께 진행되며, 이 분야에서 유망한 연구 방향으로 떠오르고 있습니다. 간단한 접근 방식은 통일된 모델을 사용하여 추론 과정에서 이미지를 직접 생성하는 것이지만, 이는 계산 비용이 많이 들고 구조적으로 복잡합니다. 최근의 대안으로는 코드 또는 도구 호출을 통한 능동적 추론, 그리고 학습 가능한 숨겨진 임베딩을 사용한 잠재적 추론이 있습니다. 그러나 능동적 방법은 외부 실행으로 인한 컨텍스트 전환 지연이 발생하며, 잠재적 방법은 작업 일반화가 어렵고, 자기 회귀 병렬화를 통한 학습이 어렵습니다. 이러한 방법들의 장점을 결합하고 단점을 완화하기 위해, 우리는 ATLAS라는 프레임워크를 제안합니다. ATLAS는 '기능 토큰'이라고 명명된 단일 이산적인 '단어'가 능동적 연산과 잠재적 시각적 추론 단위로서 모두 역할을 수행하도록 합니다. 각 기능 토큰은 내부화된 시각적 연산과 연결되어 있지만, 시각적 감독 없이 작동하며 토크나이저 어휘의 표준 토큰으로 남아 있으며, 다음 토큰 예측을 통해 생성될 수 있습니다. 이러한 설계는 상세한 중간 시각 콘텐츠 생성을 피하면서, 아키텍처 또는 방법론적 수정 없이 기존의 확장 가능한 SFT 및 강화 학습 훈련과 호환성을 유지합니다. 강화 학습 과정에서 기능 토큰의 희소성을 해결하기 위해, 우리는 기능 토큰을 고정 가중치를 갖는 보조 목표와 연결하여 훈련을 안정화하는 Latent-Anchored GRPO (LA-GRPO)를 도입했습니다. 광범위한 실험과 분석 결과, ATLAS는 어려운 벤치마크에서 우수한 성능을 달성하며 명확한 해석 가능성을 유지하는 것으로 나타났습니다. 우리는 ATLAS가 향후 시각적 추론 연구에 새로운 패러다임을 제시할 수 있기를 바랍니다.

Original Abstract

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!