2602.08533v1 Feb 09, 2026 cs.AI

에이전트 게임과 적응형 트리 기반 GRPO를 통한 대화 모델 최적화

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

Yu Liu

Citations: 5

h-index: 1

Kun Peng

Citations: 12

h-index: 2

Zhongqian Sun

Citations: 9

h-index: 2

Lei Jiang

Citations: 66

h-index: 5

Yanbing Liu

Citations: 3

h-index: 1

Conghui Tan

Citations: 422

h-index: 7

Wei Yang

Citations: 12

h-index: 2

Zining Zhu

Citations: 3

h-index: 1

Guohua Tang

Citations: 12

h-index: 2

Hao Peng

Citations: 44

h-index: 3

개방형 대화 에이전트는 사용자 특성에 맞춰 흥미롭고 개인화된 상호작용을 제공하는 것을 목표로 하지만, 기존 방법들은 사전에 수집된 사용자 데이터에 지나치게 의존하거나 장기적인 대화 가치를 간과하는 강화 학습(RL)의 단기적 편향 문제라는 중대한 한계에 직면해 있습니다. 이를 해결하기 위해 본 논문은 온라인 개인화와 적응형 트리 기반 그룹 상대 정책 최적화(AT-GRPO)를 통합한 새로운 장기적 RL 프레임워크를 제안합니다. 2-에이전트 게임 패러다임을 도입하여 사용자 에이전트는 스타일 모방(사용자별 대화 특성 학습)과 능동적 종료(턴 단위 종료 확률을 즉각적 보상으로 예측)를 통해 동적인 환경을 구축하고, 이는 대화 에이전트가 관심사 탐색을 심화하도록 유도하는 반복적 순환을 형성합니다. AT-GRPO는 대화 궤적을 트리로 재해석하고 적응형 관측 범위를 도입합니다. 지수적 오버헤드를 발생시키는 전체 트리 확장과 달리, 이 기법은 각 노드가 단계별 범위 내에서 보상을 집계하도록 제한합니다. 넓은 범위는 초기 단계의 주제 탐색을 지원하고, 좁은 범위는 후기 단계의 대화 유지를 돕습니다. 이러한 설계는 장기적 보상 포착을 유지하면서도 롤아웃 예산을 대화 길이에 대해 지수적 수준에서 다항식 수준으로 감소시킵니다. 광범위한 실험을 통해 제안된 프레임워크의 우수한 성능, 샘플 효율성 및 견고성을 입증하였습니다.

Original Abstract

Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!