2604.10708v1 Apr 12, 2026 cs.SD

Audio-Omni: 다양한 오디오 생성 및 편집을 위한 다중 모드 이해 능력 확장

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Yike Guo

Citations: 2

h-index: 1

Ruibin Yuan

Citations: 552

h-index: 11

Wei Xue

Citations: 16

h-index: 1

Zhaoyang Liu

Citations: 49

h-index: 4

Zeyue Tian

Citations: 325

h-index: 7

Binxin Yang

Citations: 49

h-index: 5

Jiexuan Zhang

Citations: 3

h-index: 1

Hubery Yin

Citations: 63

h-index: 4

Qifeng Chen

Citations: 117

h-index: 5

Chen Li

Citations: 54

h-index: 5

Jing Lyu

Citations: 6

h-index: 1

최근 다중 모드 모델의 발전은 오디오 이해, 생성 및 편집 분야에서 빠르게 진전을 가져왔습니다. 그러나 이러한 기능은 일반적으로 특수 모델에 의해 처리되며, 세 가지 작업을 완벽하게 통합할 수 있는 진정한 통합 프레임워크의 개발은 아직 충분히 연구되지 않았습니다. 일부 선구적인 연구에서는 오디오 이해와 생성을 통합하려는 시도가 있었지만, 이러한 시도는 종종 특정 영역으로 제한되었습니다. 이러한 문제를 해결하기 위해, 우리는 Audio-Omni를 소개합니다. Audio-Omni는 일반적인 소리, 음악, 음성 영역 전반에 걸쳐 생성 및 편집을 통합하고, 통합된 다중 모드 이해 기능을 제공하는 최초의 엔드투엔드 프레임워크입니다. 우리의 아키텍처는 고수준 추론을 위한 고정된 다중 모드 대규모 언어 모델과 고품질 합성을 위한 학습 가능한 디퓨전 트랜스포머를 결합합니다. 오디오 편집에서 발생하는 중요한 데이터 부족 문제를 해결하기 위해, 우리는 1백만 개 이상의 정교하게 큐레이션된 편집 쌍으로 구성된 새로운 대규모 데이터셋인 AudioEdit을 구축했습니다. 광범위한 실험 결과, Audio-Omni는 다양한 벤치마크에서 최첨단 성능을 달성했으며, 기존의 통합 방식보다 우수한 성능을 보이며, 전문 모델과 동등하거나 더 나은 성능을 보여주었습니다. Audio-Omni는 핵심 기능 외에도, 지식 기반 추론 생성, 컨텍스트 기반 생성, 그리고 오디오 생성을 위한 제로샷 교차 언어 제어와 같은 놀라운 상속 기능을 보여주며, 보편적인 생성 오디오 인텔리전스를 향한 유망한 방향을 제시합니다. 코드, 모델 및 데이터셋은 https://zeyuet.github.io/Audio-Omni 에서 공개될 예정입니다.

Original Abstract

Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!