2602.12249v2 Feb 12, 2026 cs.AI

"죄송하지만 잘 알아듣지 못했습니다": 음성 모델이 가장 중요한 정보를 어떻게 놓치는지

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Martijn Bartelds

Citations: 228

h-index: 8

Federico Bianchi

Citations: 556

h-index: 9

James Zou

Citations: 248

h-index: 5

Kaitlyn Zhou

Citations: 64

h-index: 4

음성 인식 시스템이 표준 벤치마크에서는 낮은 단어 오류율을 달성함에도 불구하고, 실제 배포 환경에서는 짧고 중요도가 높은 발화에서 자주 실패한다. 본 연구에서는 미국 참가자가 발음한 미국 도로명 전사라는 중요도가 높은 과제에서 이러한 실패 양상을 분석한다. 언어적으로 다양한 미국 화자들의 녹음본을 바탕으로 OpenAI, Deepgram, Google, Microsoft의 15개 모델을 평가한 결과, 평균 전사 오류율이 44%에 달함을 발견했다. 우리는 전사 실패가 지리적 위치에 기반한 하위 작업에 미치는 영향을 정량화하여, 잘못된 전사가 모든 화자에게 체계적으로 오류를 유발하지만 경로 안내 거리 오류의 경우 영어가 제1언어가 아닌 화자가 영어가 제1언어인 화자에 비해 두 배 더 크게 나타남을 보여준다. 이러한 피해를 완화하기 위해, 오픈 소스 텍스트-음성 변환(TTS) 모델을 사용하여 고유 명사의 다양한 발음을 생성하는 합성 데이터 생성 접근법을 도입한다. 1,000개 미만의 합성 샘플로 미세 조정을 수행하면 영어가 제1언어가 아닌 화자에 대한 도로명 전사 정확도가 (기본 모델 대비) 약 60% 향상된다. 우리의 연구 결과는 음성 시스템의 벤치마크 성능과 실제 환경에서의 신뢰성 사이의 중대한 격차를 강조하며, 중요도가 높은 전사 오류를 줄이기 위한 간단하고 확장 가능한 방법을 제시한다.

Original Abstract

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!