2604.09121v2 Apr 10, 2026 cs.CL

대화형 음성 인식: 인간과 유사한 상호작용과 에이전트 기반 음성 인식의 의미적 일관성 평가 연구

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

Qinyu Chen

Citations: 9,265

h-index: 6

Zixu Jiang

Citations: 179

h-index: 4

Xing-Xing Zhao

Citations: 5

h-index: 1

Wupeng Wang

Citations: 149

h-index: 7

Xiangang Li

Citations: 44

h-index: 4

Xie Chen

Citations: 227

h-index: 7

Peng Wang

Citations: 34

h-index: 2

Yanqiao Zhu

Citations: 149

h-index: 2

Xipeng Qiu

Citations: 130

h-index: 3

Zhifu Gao

Citations: 2,340

h-index: 18

Kai Yu

Citations: 614

h-index: 8

최근 몇 년 동안 모델 아키텍처 및 대규모 학습 데이터의 발전으로 자동 음성 인식(ASR) 분야에서 괄목할 만한 진전이 있었습니다. 그러나 여전히 탐구되지 않은 중요한 측면들이 있습니다. 첫째, 수십 년 동안 주된 평가 지표로 사용되어 온 단어 오류율(WER)은 모든 단어를 동일하게 취급하며 종종 문장 수준에서의 발화의 의미적 정확성을 제대로 반영하지 못합니다. 둘째, 인간 의사소통의 필수적인 요소인 상호 작용적인 수정은 ASR 연구에서 체계적으로 연구된 적이 거의 없습니다. 본 논문에서는 이러한 두 가지 관점을 에이전트 기반 프레임워크 하에서 통합하여 대화형 ASR을 연구합니다. 우리는 토큰 수준의 정확도를 넘어 인식 품질을 평가하기 위한 의미 기반 평가 지표로서 LLM-as-a-Judge를 활용하는 방법을 제안합니다. 또한, 인간과 유사한 다중 턴 상호작용을 시뮬레이션하는 LLM 기반 에이전트 프레임워크를 설계하여 의미적 피드백을 통해 인식 결과물을 반복적으로 개선할 수 있도록 합니다. 표준 벤치마크인 GigaSpeech (영어), WenetSpeech (중국어) 및 ASRU 2019 코드 스위칭 테스트 세트에서 광범위한 실험을 수행했습니다. 객관적 및 주관적 평가 모두에서 제안된 프레임워크가 의미적 정확성과 상호 작용적인 수정 능력 향상에 효과적임을 입증했습니다. 향후 대화형 및 에이전트 기반 ASR 연구를 촉진하기 위해 코드를 공개할 예정입니다.

Original Abstract

Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!