2604.18131v1 Apr 20, 2026 cs.AI

세계 지식 탐색을 통한 보상 없는 자발적 자기 진화를 위한 LLM 에이전트 훈련

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Nuo Chen

Citations: 513

h-index: 12

Haitao Mi

Citations: 6

h-index: 1

Dongyang Ma

Citations: 38

h-index: 3

Yan Wang

Citations: 4

h-index: 1

Qifan Zhang

Citations: 85

h-index: 4

Jing Tang

Citations: 29

h-index: 3

Jia Li

Citations: 172

h-index: 8

Tianqing Fang

Tencent AI Lab

Citations: 1,167

h-index: 20

현재 대부분의 에이전트는 인간이 정의한 보상과 규칙에 따라 '자기 진화'를 수행합니다. 그러나 이 과정은 근본적으로 외부 감독에 의존하며, 인간의 지 guidance 없이 진화는 중단됩니다. 본 연구에서는 에이전트가 작업을 수행하기 전에 아직 탐색하지 못한 환경에 대한 지식을 자발적으로 학습하는 내재적인 메타 진화 능력을 갖도록 훈련합니다. 이러한 능력을 함양하기 위해, 에이전트가 생성한 세계 지식이 하위 작업에서의 성공률을 얼마나 향상시키는지 측정하는 결과 기반의 보상 메커니즘을 설계했습니다. 이 보상 신호는 모델이 효과적으로 탐색하고 요약하는 방법을 학습하는 훈련 단계에서만 사용됩니다. 추론 단계에서 에이전트는 외부 보상이나 인간의 지시 없이 자체적으로 내부 파라미터를 사용하여 알려지지 않은 환경에 적응하는 '원시' 자기 진화를 수행합니다. Qwen3-30B 및 Seed-OSS-36B에 적용한 결과, 이러한 '원시' 진화로 인해 WebVoyager 및 WebWalker에서 20%의 성능 향상을 얻었습니다. 더욱 놀라운 점은, 생성된 세계 지식이 14B의 Qwen3 모델이 외부 지원 없이 작동하는 Gemini-2.5-Flash보다 뛰어난 성능을 발휘하도록 만들었으며, 이는 진정으로 진화하는 에이전트에 대한 새로운 패러다임을 제시합니다.

Original Abstract

Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters. When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!