2603.01481v1 Mar 02, 2026 cs.AI

다중 단계 강화 학습에서 밀집 및 희소 신호 조화: 산업용 영업 에이전트를 위한 이중 시간 범위 신용 할당

Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents

Xunliang Cai

Citations: 15

h-index: 2

Ke Zeng

Citations: 56

h-index: 4

Yiwei Wang

Citations: 424

h-index: 12

Ai Jian

Citations: 8

h-index: 2

Xinyu Huang

Citations: 27

h-index: 3

Jingqing Ruan

Citations: 57

h-index: 4

Haojin Yang

Citations: 5,156

h-index: 29

Weipeng Zhang

Citations: 47

h-index: 4

산업용 영업에 대규모 언어 모델을 최적화하려면 장기적인 상업적 목표(예: 전환율)와 유창성 및 규정 준수와 같은 즉각적인 언어적 제약 조건을 균형 있게 고려해야 합니다. 기존 강화 학습 방법은 종종 이러한 이질적인 목표를 하나의 보상으로 통합하여, 세션 레벨의 큰 보상이 세부적인 턴 레벨 신호를 압도하여 불안정한 학습이나 보상 악용을 초래합니다. 이러한 문제를 해결하기 위해, 우리는 시간 규모에 따른 최적화를 분리하는 프레임워크인 이중 시간 범위 신용 할당(DuCA)을 제안합니다. DuCA의 핵심 구성 요소인 Horizon-Independent Advantage Normalization (HIAN)은 턴 레벨 및 세션 레벨 보상에서 얻은 이점을 융합하기 전에 각각 정규화하여, 즉각적인 목표와 장기적인 목표 모두에서 정책 업데이트에 균형 잡힌 기울기 기여를 보장합니다. 고품질 사용자 시뮬레이터를 사용한 광범위한 실험 결과, DuCA는 최첨단 GRPO 기준 모델보다 뛰어난 성능을 보였으며, 전환율은 6.82% 향상되었고, 문장 반복은 82.28% 감소했으며, 동일성 탐지율은 27.35% 감소했습니다. 이는 전략적 성능과 자연스러운 언어 생성이라는 이중 요구 사항을 효과적으로 균형 있게 처리하는 산업용 영업 시나리오에서 상당한 개선을 나타냅니다.

Original Abstract

Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking. To address this issue, we propose Dual-Horizon Credit Assignment (DuCA), a framework that disentangles optimization across time scales. Its core, Horizon-Independent Advantage Normalization (HIAN), separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update. Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.

1 Citations

0 Influential

14.5 Altmetric

73.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!