2601.18037v1 Jan 25, 2026 eess.AS

SpatialEmb: 1단계 멀티 채널 멀티 스피커 음성 인식 시스템에서 임의의 마이크 배열에 대한 공간 정보 추출 및 인코딩

SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-channel Multi-speaker ASR on Arbitrary Microphone Arrays

Yiwen Shao

Citations: 55

h-index: 4

Yong Xu

Citations: 74

h-index: 4

S. Khudanpur

Citations: 36,672

h-index: 63

Dong Yu

Citations: 44

h-index: 3

공간 정보는 멀티 채널 멀티 스피커 환경에서의 목표 음성 인식에 중요한 단서입니다. 대부분의 최첨단 멀티 채널 자동 음성 인식(ASR) 시스템은 음성 분리 단계에서만 공간 특징을 추출하고, 그 후 분리된 음성에 대한 표준 단일 채널 ASR을 수행합니다. 이러한 접근 방식은 비효율적이고 시간이 오래 걸리는 파이프라인을 초래하며, 전처리 모듈에서 발생하는 누적 오류로 인해 최적의 ASR 성능을 달성하지 못합니다. 또한, 대부분의 공간 특징 추출 방법은 스피커 위치 및 마이크 구성에 대한 지식을 필요로 하므로, 시스템이 특정 설정에 의존하게 되고 새로운 장비에 적용하기 어렵습니다. 본 연구에서는 이러한 문제를 해결하기 위해 SpatialEmb이라는 가벼운 임베딩 모듈을 제안합니다. SpatialEmb은 ASR 모델을 위해 직접 공간 정보를 추출하고 인코딩하며, 고정된 마이크 구성뿐만 아니라 임의의 마이크 구성도 지원합니다. 실제 회의 데이터인 AliMeeting을 사용하여 SpatialEmb의 성능 및 효율성을 최적화하기 위한 다양한 실험을 수행했습니다. 105시간의 Train-Ali-far 데이터로 학습된 최적 모델은 Eval 및 Test 데이터 세트에서 각각 17.04% 및 20.32%의 문자 오류율(CER)을 달성하여 동일한 학습 데이터를 사용한 기존 최고 성능을 능가하는 새로운 최고 성능을 기록했습니다.

Original Abstract

Spatial information is a critical clue for multi-channel multi-speaker target speech recognition. Most state-of-the-art multi-channel Automatic Speech Recognition (ASR) systems extract spatial features only during the speech separation stage, followed by standard single-channel ASR on the separated speech. This approach results in an inefficient, lengthy pipeline and sub-optimal ASR performance due to the accumulated errors from preprocessing modules. Furthermore, most spatial feature extraction methods depend on the knowledge of speaker positions and microphone topology, making the systems reliant on specific settings and challenging to adapt to new equipment. In this work, we propose a solution to these issues with a lightweight embedding module named SpatialEmb, which extracts and encodes spatial information directly for the ASR model, supporting both fixed and arbitrary microphone topology. We conduct comprehensive experiments on AliMeeting, a real meeting corpus, to determine the optimal model design for SpatialEmb in terms of both performance and efficiency. Our best model trained with 105 hours Train-Ali-far achieves 17.04% and 20.32% character error rates (CER) on the Eval and Test sets, establishing a new state-of-the-art result with the same training data.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!