2601.11354v1 Jan 16, 2026 cs.AI

AstroReason-Bench: 이질적 우주 계획 문제 전반에 걸친 통합 에이전트 계획 평가

AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems

Xinchi Chen

Citations: 1,454

h-index: 17

Jingjing Gong

Citations: 267

h-index: 7

Xuanjing Huang

Citations: 3,773

h-index: 33

Xipeng Qiu

Citations: 1,073

h-index: 15

Weiyi Wang

Citations: 3,360

h-index: 4

에이전트형 대규모 언어 모델(LLM)의 최근 발전은 이를 다양한 작업에 걸쳐 추론하고 행동할 수 있는 범용 계획자(generalist planner)로 자리매김하게 했습니다. 그러나 기존 에이전트 벤치마크는 주로 기호적이거나 현실 기반이 약한 환경에 초점을 맞추고 있어, 물리적 제약이 있는 실제 세계 영역에서의 성능은 충분히 탐구되지 않았습니다. 이에 우리는 이질적인 목표, 엄격한 물리적 제약, 장기적인 의사 결정을 특징으로 하는 중요도 높은 문제군인 우주 계획 문제(SPP)에서 에이전트 계획 능력을 평가하기 위한 포괄적인 벤치마크인 AstroReason-Bench를 소개합니다. AstroReason-Bench는 지상국 통신 및 민첩한 지구 관측을 포함한 여러 스케줄링 체제를 통합하며, 통일된 에이전트 지향 상호작용 프로토콜을 제공합니다. 다양한 최첨단 오픈 소스 및 폐쇄형 소스 에이전트형 LLM 시스템을 평가한 결과, 현재의 에이전트들이 전문 솔버(specialized solvers)에 비해 성능이 현저히 떨어지는 것으로 나타났으며, 이는 현실적인 제약 하에서 범용 계획의 주요 한계점을 시사합니다. AstroReason-Bench는 향후 에이전트 연구를 위한 도전적이고 진단적인 테스트베드를 제공합니다.

Original Abstract

Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.

1 Citations

0 Influential

16.5 Altmetric

83.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!