2602.13954v1 Feb 15, 2026 cs.SD

Eureka-Audio: 소형 언어 모델에서 오디오 인텔리전스 활성화

Eureka-Audio: Triggering Audio Intelligence in Compact Language Models

Shikun Feng

Citations: 3,750

h-index: 8

Haifeng Wang

Citations: 209

h-index: 6

Dan Zhang

Citations: 21

h-index: 3

Yishu Lei

Citations: 9

h-index: 2

Jing Hu

Citations: 3,255

h-index: 3

Shuwei He

Citations: 41

h-index: 4

Songhe Deng

Citations: 9

h-index: 1

Xianlong Luo

Citations: 22

h-index: 3

Danxiang Zhu

Citations: 1,043

h-index: 4

Rui Liu

Citations: 33

h-index: 4

Jingzhou He

Citations: 9

h-index: 2

Yu Sun

Citations: 8

h-index: 2

Hua Wu

Citations: 107

h-index: 5

본 논문에서는 Eureka-Audio를 소개합니다. Eureka-Audio는 크기가 작지만 성능이 뛰어난 오디오 언어 모델로, 다양한 오디오 이해 벤치마크에서 4배에서 18배 더 큰 모델과 경쟁력 있는 성능을 보입니다. Eureka-Audio는 17억 개의 파라미터만 포함하고 있지만, 자동 음성 인식(ASR), 오디오 이해 및 밀집 오디오 캡셔닝에서 뛰어난 성능을 보여주며, 여러 70억에서 30억 개의 파라미터를 가진 오디오 및 멀티모달 기준 모델과 동등하거나 뛰어넘는 성능을 달성합니다. 이 모델은 경량 언어 기반 구조, Whisper 기반 오디오 인코더, 그리고 오디오의 이질성을 고려하고 제한된 용량 하에서 멀티모달 최적화 문제를 완화하는 희소 활성화 Mixture-of-Experts (MoE) 어댑터로 구성된 통합 엔드-투-엔드 아키텍처를 채택합니다. 또한, 비언어적 추론 능력을 향상시키기 위해, 고품질의 논리적으로 일관된 지도 데이터를 원시 오디오에서 생성하고 검증하는 폐루프 오디오 명령어 데이터 합성 및 검증 파이프라인인 DataFlux를 소개합니다. ASR, 지식 추론, 안전성, 명령어 준수 및 비언어적 벤치마크에 대한 광범위한 평가 결과, Eureka-Audio는 계산 비용과 성능 사이의 효율적인 균형을 달성함을 보여줍니다. 이러한 결과는 Eureka Audio를 경량 오디오 이해 모델에 대한 강력하고 실용적인 기본 모델로 확립합니다.

Original Abstract

We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning, matching or surpassing multiple 7B to 30B audio and omni-modal baselines. The model adopts a unified end-to-end architecture composed of a lightweight language backbone, a Whisper-based audio encoder, and a sparsely activated Mixture-of-Experts (MoE) adapter that explicitly accounts for audio heterogeneity and alleviates cross-modal optimization conflicts under limited capacity. To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline that constructs high quality, logically consistent supervision from raw audio. Extensive evaluations across ASR, knowledge reasoning, safety, instruction following, and paralinguistic benchmarks, demonstrate that Eureka-Audio achieves an efficient balance between computational cost and performance. These results establish Eureka Audio as a strong and practical baseline for lightweight audio understanding models.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!