2603.12056v1 Mar 12, 2026 cs.AI

XSkill: 다중 모드 에이전트에서 경험과 기술을 활용한 지속적인 학습

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Xiaoye Qu

Citations: 1,394

h-index: 19

Zhaochen Su

Citations: 317

h-index: 5

Guanyu Jiang

Citations: 68

h-index: 4

Yi R. Fung

Citations: 260

h-index: 5

다중 모드 에이전트는 이제 다양한 도구를 활용하여 복잡한 추론 작업을 수행할 수 있지만, 여전히 개방형 환경에서 비효율적인 도구 사용과 유연성이 부족한 방식으로 작동하는 문제가 있습니다. 이러한 에이전트가 파라미터 업데이트 없이 과거의 경험을 통해 지속적으로 개선될 수 있도록 하는 것이 중요한 과제입니다. 우리는 이 목표를 달성하기 위해 필요한 두 가지 상호 보완적인 재사용 가능한 지식을 식별했습니다. 첫째, 도구 선택 및 의사 결정에 대한 간결한 액션 레벨 지침을 제공하는 '경험'과 둘째, 계획 및 도구 사용에 대한 구조화된 태스크 레벨 지침을 제공하는 '기술'입니다. 이에 따라, 다중 모드 에이전트에서 경험과 기술을 활용한 지속적인 학습을 위한 이중 스트림 프레임워크인 XSkill을 제안합니다. XSkill은 시각적 관찰을 기반으로 지식 추출 및 검색을 수행합니다. 축적 단계에서 XSkill은 시각적으로 기반한 요약 및 크로스-롤아웃 비판을 통해 다중 경로 롤아웃에서 경험과 기술을 추출하고 통합합니다. 추론 단계에서는 이 지식을 현재 시각적 컨텍스트에 맞게 검색하고 적용하며, 사용 기록을 축적 단계로 피드백하여 지속적인 학습 루프를 형성합니다. XSkill은 4개의 기본 모델을 사용하여 다양한 도메인의 5가지 벤치마크에서 평가되었으며, 도구만 사용하거나 학습 기반의 기존 방법보다 일관되고 현저하게 우수한 성능을 보였습니다. 추가 분석 결과, 두 가지 지식 스트림이 에이전트의 추론 행동에 상호 보완적인 역할을 하며, 우수한 제로샷 일반화 능력을 보여주는 것으로 나타났습니다.

Original Abstract

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

34 Citations

4 Influential

9.5 Altmetric

89.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!