2602.21772v1 Feb 25, 2026 cs.SD

UniWhisper: 강력한 범용 오디오 표현을 위한 효율적인 연속적 멀티태스크 학습

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

Yuxuan Chen

Citations: 20

h-index: 3

Peize He

Citations: 6

h-index: 1

Junzi Zhang

Citations: 10

h-index: 2

Haoyuan Xu

Citations: 22

h-index: 3

범용 오디오 표현은 단일 인코더에서 환경 소음과 음악의 고수준 의미를 포괄하며, 미세한 음성 특징을 모두 담아야 합니다. 기존 인코더는 특정 분야에서는 뛰어난 성능을 보이지만 다른 분야에서는 성능이 저하되는 경향이 있습니다. 본 논문에서는 UniWhisper라는 효율적인 연속적 멀티태스크 학습 프레임워크를 제안합니다. UniWhisper는 다양한 오디오 태스크를 통합된 지시 및 응답 형식으로 변환하여, 태스크별 헤드 및 손실 함수 없이 표준적인 다음 토큰 학습을 가능하게 합니다. 우리는 38,000시간의 공개 오디오 데이터로 UniWhisper를 학습하고, 음성, 환경 소음 및 음악을 포함하는 20개의 태스크에서 얕은 MLP 프로브 및 k-최근접 이웃(kNN)을 사용하여 인코더의 성능을 평가했습니다. UniWhisper는 MLP 프로브를 사용했을 때 정규화된 가중 평균 0.81, kNN을 사용했을 때 0.61의 성능을 달성했으며, 이는 Whisper의 0.64 및 0.46에 비해 우수한 성능입니다. 또한 UniWhisper는 뛰어난 음성 성능을 유지합니다.

Original Abstract

A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!