2602.17038v1 Feb 19, 2026 cs.AI

에이전트 기반 강화학습을 위한 단계 인지 전문가 혼합

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

Yu Li

Citations: 8

h-index: 2

Shengtian Yang

Citations: 16

h-index: 3

Shuo He

Citations: 9

h-index: 2

Yewen Li

Citations: 49

h-index: 4

Peng Jiang

Citations: 86

h-index: 4

Q. Cai

Citations: 300

h-index: 9

Lei Feng

Citations: 25

h-index: 3

강화학습(RL)은 거대 언어 모델(LLM) 에이전트가 복잡한 작업을 해결할 수 있는 강력한 능력을 갖추도록 하였다. 그러나 기존의 RL 방법들은 일반적으로 '단일' 정책 네트워크를 사용하기 때문에, 단순한 작업이 대부분의 파라미터를 차지하고 기울기 업데이트를 지배하여 복잡한 작업을 처리할 용량이 부족해지는 '단순성 편향(simplicity bias)'을 유발한다. 이에 대한 타당한 해결책은 정책 네트워크에 전문가 혼합(MoE) 아키텍처를 도입하는 것일 수 있다. MoE는 서로 다른 파라미터(전문가)가 각기 다른 작업에 특화되도록 허용하여, 단순한 작업이 모든 파라미터를 지배하는 것을 방지하기 때문이다. 하지만 전통적인 MoE의 주요 한계는 라우터가 각 토큰을 특화된 전문가에게 할당하는 토큰 수준의 라우팅 방식에 있다. 이는 단계적으로 일관된 패턴을 여러 전문가에게 분산 할당되도록 파편화시켜 결과적으로 전문가의 전문성을 저해한다. 본 논문에서는 단계 인지 전문가 혼합(Phase-Aware Mixture of Experts, PA-MoE)을 제안한다. 이 모델은 단계 범주를 사전에 정의하지 않고 RL 목표로부터 직접 잠재적인 단계 경계를 학습하는 경량화된 '단계 라우터(phase router)'를 갖추고 있다. 그런 다음, 단계 라우터는 동일한 전문가에게 시간적으로 일관성 있는 할당을 배정하여 전문가들이 특정 단계에 대한 전문 지식을 보존할 수 있도록 한다. 실험 결과는 우리가 제안한 PA-MoE의 효과성을 입증한다.

Original Abstract

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!