2604.14932v1 Apr 16, 2026 cs.AI

WavAlign: 적응형 하이브리드 후속 학습을 통한 음성 대화 모델의 지능 및 표현력 향상

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Yifu Chen

Citations: 353

h-index: 5

Shengpeng Ji

Citations: 1,164

h-index: 16

Qian Chen

Citations: 698

h-index: 12

Tianle Liang

Citations: 31

h-index: 4

Yangzhuo Li

Citations: 37

h-index: 4

Ziqing Wang

Citations: 35

h-index: 3

Wen Wang

Citations: 895

h-index: 11

Jingyu Lu

Citations: 221

h-index: 5

Haoxiao Wang

Citations: 109

h-index: 5

Xue Pu

Citations: 25

h-index: 3

Fan Zhuo

Citations: 9

h-index: 2

Zhou Zhao

Citations: 150

h-index: 4

종단 간 음성 대화 모델은 캐스케이드 시스템보다 표현력과 인지 능력 측면에서 더 높은 잠재력을 제공하기 때문에 많은 관심을 받고 있습니다. 그러나 현재 공개된 음성 대화 모델의 지능과 표현력은 종종 기대에 미치지 못합니다. 다른 분야에서 온라인 강화 학습(RL)의 성공에 영감을 받아, 음성 대화 모델에 직접 선호도 최적화를 적용하려는 시도가 있을 수 있지만, 이러한 적용은 쉽지 않습니다. 본 연구에서는 보상 모델링 및 롤아웃 샘플링의 관점에서 이러한 어려움을 분석하고, 희소한 선호도 지도 학습이 공유 매개변수 업데이트 하에서 밀집된 음성 생성에 어떻게 영향을 미치는지에 주목합니다. 분석 결과를 바탕으로, 본 연구는 음성 대화 모델에 RL을 적용할 수 있도록 하는 모달리티 인식적 적응형 후속 학습 방법을 제안합니다. 이 방법은 선호도 업데이트를 의미 채널에 제한하고, 명시적인 앵커링을 통해 음향 성능을 향상시키며, 롤아웃 통계에 따라 혼합 비율을 동적으로 조절하여 신뢰할 수 없는 선호도 기울기를 방지합니다. 제안하는 방법은 다양한 음성 대화 벤치마크 및 대표적인 아키텍처에서 평가되었으며, 의미 품질 및 음성 표현력 측면에서 일관된 성능 향상을 보였습니다.

Original Abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

4 Citations

0 Influential

8 Altmetric

44.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!