2603.15434v1 Mar 16, 2026 cs.AI

메아리를 듣다: 사용자 반응 인지 정책 최적화를 위한 스칼라-언어 혼합 강화 학습

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Xinpei Zhao

Citations: 25

h-index: 2

Lu Xiang

Citations: 394

h-index: 10

Chengqing Zong

Citations: 264

h-index: 9

Yaping Zhang

Citations: 200

h-index: 7

Jing Ye

Citations: 44

h-index: 3

현재의 감정 지원 대화 시스템은 일반적으로 전문가가 정의한 스칼라 보상을 사용하여 목표에 맞추도록 설계되지만, 이러한 신호는 심각한 정보 부족 문제를 가지고 있습니다. 이는 응답이 실패한 이유나 사용자의 변화하는 상태에 어떻게 적응해야 하는지에 대한 설명을 제공하지 못하며, 종종 긍정적인 감정 변화를 유도한다는 실제 목표와는 동떨어져 있습니다. 실제로 가장 직접적이고 신뢰할 수 있는 학습 신호는 지속적인 상호 작용 동안 사용자의 반응에서 비롯됩니다. 따라서 우리는 상호 작용 결과에 기반하여 정책을 최적화하는 프레임워크인 Reaction Aware Policy Optimization (RAPO)을 제안합니다. RAPO는 대화를 반응 중심의 과정으로 간주하고, 시뮬레이션된 사용자 응답을 사용하여 세 가지 핵심 구성 요소를 통해 풍부한 자연어 피드백을 생성합니다. 첫째, Hindsight Dialogue Selection은 사용자의 감정 변화에 의미 있는 영향을 미치는 중요한 대화 단계를 식별합니다. 둘째, Generative Hindsight Feedback은 사용자 반응을 대비 순위 신호와 자연어 비판으로 변환합니다. 셋째, Scalar-Verbal Hybrid Policy Optimization은 전반적인 목표 정렬을 위한 스칼라 보상 최적화와 세밀한 의미적 개선을 위한 언어 피드백 추출을 결합합니다. ESC 및 Sotopia 데이터셋에 대한 광범위한 실험 결과, RAPO는 긍정적인 상호 작용 결과를 유도하는 데 있어 강력한 강화 학습 기준 모델보다 훨씬 우수한 성능을 보였습니다.

Original Abstract

While current emotional support dialogue systems typically rely on expert-defined scalar rewards for alignment, these signals suffer from severe information sparsity. They cannot explain why a response failed or how to adapt to dynamic user states, often diverging from the actual goal of facilitating positive emotional shifts. In practice, the most direct and reliable learning signal emerges from the user's continuous reactions during ongoing interaction. We therefore propose Reaction Aware Policy Optimization (RAPO), a framework that optimizes over interaction consequences rather than rubric scores. RAPO treats dialogue as a reaction-driven process and utilizes simulated user responses to generate dense natural-language feedback through three core components: Hindsight Dialogue Selection, which isolates pivotal turns that meaningfully alter user emotional trajectories; Generative Hindsight Feedback, which transforms user reactions into contrastive ranking signals and natural-language critiques; and Scalar-Verbal Hybrid Policy Optimization, which couples scalar reward optimization for global alignment with verbal feedback distillation for fine-grained semantic refinement. Extensive experiments on ESC and Sotopia demonstrate that RAPO significantly outperforms strong reinforcement learning baselines in driving positive interaction outcomes.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!