2602.15854v2 Jan 24, 2026 cs.CL

목표 지향적 선호도 최적화를 통한 작업 중심 대화 시스템에서의 전략 분리 및 실행

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jing Xu

Citations: 1,456

h-index: 11

Xingyu Ren

Citations: 16

h-index: 2

Zhou-Peng Shou

Citations: 0

h-index: 0

Yumeng Zhang

Citations: 83

h-index: 3

Zhi-Qiang You

Citations: 0

h-index: 0

대규모 언어 모델은 작업 지향적 대화 시스템에서 잠재력을 보여주지만, 기존의 훈련 방법은 종종 토큰 수준의 확률 또는 선호도 최적화에 의존하는데, 이는 장기적인 작업 성공과 잘 맞지 않습니다. 이를 해결하기 위해, 우리는 전략 계획과 응답 생성을 전문가 에이전트와 고객 서비스 에이전트를 통해 분리하는 계층적 강화 학습 프레임워크인 목표 지향적 선호도 최적화(GOPO)를 제안합니다. 전문가 에이전트는 대화 경로 수준에서 다단계 목표 선호도를 최적화하는 반면, 고객 서비스 에이전트는 선택된 전략과 엄격하게 일치하는 응답을 생성합니다. 우리는 GOPO를 공개 벤치마크 및 전자 상거래 고객 서비스 데이터 세트에 대해 평가하고, 실제 전자 상거래 상호 작용 데이터에서 파생된 시퀀스 수준 메트릭인 작업 중심 순차적 참여(TSE)를 소개합니다. Mgshop 데이터 세트에서 GOPO는 PPO 및 Memento보다 TSE를 각각 7.7% 및 10.3% 향상시켰으며, 시퀀스 수준 보상 및 생성 품질에서 일관된 향상을 보였습니다. 또한, GOPO로 훈련된 140억 파라미터 모델은 Qwen-235B 및 GPT-5.2보다 TSE에서 각각 2.7% 및 1.5% 더 높은 성능을 보였습니다. 제거 실험(Ablation studies)은 전문가 에이전트가 장기 최적화에서 중요한 역할을 한다는 것을 확인했습니다. GOPO는 다른 데이터 세트에서도 일관된 성능 향상을 보여줍니다. 이 연구는 상업적 시나리오에서 작업 지향적 대화 시스템에 대한 새로운 패러다임을 제시하며, 코드 및 데이터 세트는 공개될 예정입니다.

Original Abstract

Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!