2604.10517v1 Apr 12, 2026 cs.AI

인식에서 계획까지: 커리큘럼 학습을 통한 개인 중심의 작업 지향적 시공간 추론의 진화

From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

Shuicheng Yan

Citations: 73

h-index: 5

Tao Jin

Citations: 2

h-index: 1

Zhou Zhao

Citations: 163

h-index: 8

Yao Mu

Citations: 300

h-index: 2

Xiaoda Yang

Citations: 447

h-index: 9

Yuxiang Liu

Citations: 93

h-index: 2

Lixin Yang

Citations: 1,395

h-index: 16

Zhimeng Zhang

Citations: 58

h-index: 4

Shenzhou Gao

Citations: 0

h-index: 0

Can Wang

Citations: 0

h-index: 0

Jingyang Xue

Citations: 4

h-index: 1

최신 비전-언어 모델은 정적인 인식 분야에서 뛰어난 성능을 보이지만, 로봇이나 에이전트가 수행하는 복잡한 시공간 추론 능력은 여전히 제한적입니다. 이러한 한계의 주요 원인은 수동 비디오 데이터로부터 학습된 시간적 선입견에 의존하는 것으로, 이는 종종 시공간적 환각을 유발하고 동적인 환경에서 일반화 성능을 저하시킵니다. 이러한 문제를 해결하기 위해, 본 연구에서는 작업 지향적인 시공간 추론을 학습하는 커리큘럼 기반 프레임워크인 EgoTSR을 제안합니다. EgoTSR은 에이전트의 추론이 명시적인 공간 이해에서 시작하여 내부화된 작업 상태 평가로, 그리고 최종적으로 장기적인 계획 수립으로 발전해야 한다는 전제하에 설계되었습니다. 이러한 패러다임을 지원하기 위해, 본 연구에서는 4600만 개의 샘플로 구성된 대규모 데이터셋인 EgoTSR-Data를 구축했습니다. 이 데이터셋은 세 단계로 구성되어 있습니다: 추론 과정 (Chain-of-Thought) 기반 지도 학습, 약하게 지도된 태깅, 그리고 장기적인 시퀀스 데이터. 광범위한 실험 결과, EgoTSR은 시간적 편향을 효과적으로 제거하고, 92.4%의 정확도를 달성하여 장기적인 논리적 추론 작업에서 기존의 공개 및 비공개 최고 성능 모델보다 훨씬 뛰어난 성능을 보였습니다. 또한, 고정밀의 인식 능력을 유지합니다.

Original Abstract

Modern vision-language models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we present EgoTSR, a curriculum-based framework for learning task-oriented spatiotemporal reasoning. EgoTSR is built on the premise that embodied reasoning should evolve from explicit spatial understanding to internalized task-state assessment and finally to long-horizon planning. To support this paradigm, we construct EgoTSR-Data, a large-scale dataset comprising 46 million samples organized into three stages: Chain-of-Thought (CoT) supervision, weakly supervised tagging, and long-horizon sequences. Extensive experiments demonstrate that EgoTSR effectively eliminates chronological biases, achieving 92.4% accuracy on long-horizon logical reasoning tasks while maintaining high fine-grained perceptual precision, significantly outperforming existing open-source and closed-source state-of-the-art models.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!