2601.08173v1 Jan 13, 2026 cs.AI

에이전트의 첫 출근: 업무 환경 시나리오에서의 학습, 탐색 및 스케줄링 벤치마킹

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Daocheng Fu

Citations: 1,691

h-index: 20

Jianbiao Mei

Citations: 915

h-index: 16

Rong Wu

Zhejiang University

Citations: 128

h-index: 5

Xuemeng Yang

Citations: 507

h-index: 12

Ding Wang

Citations: 74

h-index: 5

Pinlong Cai

Citations: 2,009

h-index: 19

Yong Liu

Citations: 23

h-index: 2

Licheng Wen

Citations: 1,214

h-index: 15

Botian Shi

Citations: 561

h-index: 13

Jia Xu

Citations: 15

h-index: 2

멀티모달 대형 언어 모델(MLLM)의 급속한 발전은 워크플로우 자동화를 진전시켰으나, 기존 연구는 주로 정적인 환경에서의 성능 상한선에만 집중하여 확률적인 실제 배포 환경에서의 견고성을 간과하고 있습니다. 우리는 세 가지 핵심 과제로 동적 작업 스케줄링, 불확실성 하에서의 능동적 탐색, 그리고 경험을 통한 지속적인 학습을 식별했습니다. 이러한 격차를 해소하기 위해, 본 연구에서는 낯선 환경을 지속적으로 탐색하는 '수습' 에이전트를 시뮬레이션하는 동적 평가 환경인 \method{}를 소개합니다. 기존 벤치마크와 달리 \method{}는 세 가지 차원에서 에이전트를 평가합니다: (1) 다양한 우선순위를 가진 스트리밍 작업에 대한 문맥 인식 스케줄링, (2) 능동적 탐색을 통해 환각 현상을 줄이기 위한 신중한 정보 수집, (3) 규칙 기반의 동적으로 생성된 작업에서 일반화된 전략을 추출하여 이루어지는 지속적인 진화입니다. 실험 결과, 최신 에이전트들도 동적 환경, 특히 능동적 탐색과 지속적 학습 부분에서 상당한 결함이 있음이 드러났습니다. 본 연구는 에이전트의 신뢰성을 평가하기 위한 프레임워크를 수립하여, 평가의 중심을 정적인 테스트에서 현실적이고 생산 지향적인 시나리오로 전환합니다. 관련 코드는 https://github.com/KnowledgeXLab/EvoEnv 에서 확인할 수 있습니다.

Original Abstract

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

1 Citations

0 Influential

42.824746787308 Altmetric

215.1 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!