2604.14920v1 Apr 16, 2026 cs.AI

인터랙티브 음성 대화 모델에서 의미론적 정확성과 턴-테이킹의 강건성을 향한 이중 축 생성 보상 모델

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

Yifu Chen

Citations: 353

h-index: 5

Shengpeng Ji

Citations: 1,164

h-index: 16

Qian Chen

Citations: 698

h-index: 12

Tianle Liang

Citations: 31

h-index: 4

Yangzhuo Li

Citations: 37

h-index: 4

Ziqing Wang

Citations: 35

h-index: 3

Wen Wang

Citations: 895

h-index: 11

Zhou Zhao

Citations: 150

h-index: 4

Zheng Liu

Citations: 13

h-index: 3

완벽하게 양방향 음성 대화 모델(SDM)이 인간과 유사한 자연스러운 상호작용을 구현하는 것은 여전히 중요한 과제입니다. 강화 학습(RL)은 텍스트 및 시각-언어 모델의 성능을 크게 향상시켰으며, 잘 설계된 보상 신호는 RL의 성능에 매우 중요합니다. 우리는 SDM의 핵심 과제를 해결하는 데 RL이 유망한 전략이라고 생각합니다. 그러나, 기존의 자동화된 상호작용 품질 평가 지표는 행동 통계 또는 타이밍 예측 정확도와 같은 피상적인 지표에 의존하여 신뢰할 수 있는 RL 보상 신호를 제공하지 못한다는 근본적인 문제가 남아 있습니다. 반면에, 인간 평가 방식은 풍부한 정보를 제공하지만, 비용이 많이 들고 일관성이 부족하며 확장하기 어렵습니다. 우리는 이 중요한 문제를 해결하기 위해 이중 축 생성 보상 모델을 제안합니다. 이 모델은 상세한 분류 체계와 주석이 달린 데이터셋을 사용하여 복잡한 상호작용 역학을 이해하도록 학습되며, 단일 점수를 생성할 뿐만 아니라, 의미론적 품질과 상호작용 타이밍에 대한 별도의 평가를 제공합니다. 이러한 이중 출력은 SDM에 대한 정확한 진단 피드백을 제공하며, 온라인 강화 학습에 적합한 신뢰할 수 있고 유용한 보상 신호를 제공합니다. 우리 모델은 합성 대화와 복잡한 실제 상호작용을 포함한 광범위한 데이터셋에서 상호작용 품질 평가에 있어 최첨단 성능을 달성했습니다.

Original Abstract

Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.

4 Citations

0 Influential

8 Altmetric

44.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!