2601.18184v1 Jan 26, 2026 cs.SD

VIBEVOICE-ASR 기술 보고서

VIBEVOICE-ASR Technical Report

Liang Wang

Citations: 81

h-index: 4

Furu Wei

Citations: 249

h-index: 8

Yi Zhu

Citations: 65

h-index: 5

Shaohan Huang

Citations: 482

h-index: 7

Zhiliang Peng

Citations: 4,378

h-index: 14

Jianwei Yu

Citations: 38

h-index: 2

Yaoyao Chang

Citations: 44

h-index: 3

Zilong Wang

Citations: 10

h-index: 1

Li Dong

Citations: 18

h-index: 2

Ying Hao

Citations: 22

h-index: 2

Yujie Tu

Citations: 11

h-index: 1

Chenyu Yang

Citations: 69

h-index: 4

Wenhui Wang

Citations: 10,274

h-index: 24

Songcheng Xu

Citations: 79

h-index: 3

Yutao Sun

Tsinghua University

Citations: 1,974

h-index: 12

Hangbo Bao

Harbin Institute of Technology

Citations: 10,089

h-index: 17

Weijiang Xu

Citations: 118

h-index: 4

Zehua Wang

Citations: 61

h-index: 3

Ting Song

Citations: 800

h-index: 9

Yan Xia

Citations: 10

h-index: 1

Zewen Chi

Beijing Institute of Technology

Citations: 2,848

h-index: 19

Chuang Ding

Citations: 541

h-index: 8

Shuai Wang

Citations: 11

h-index: 1

Xie Chen

Citations: 83

h-index: 2

본 보고서는 VibeVoice를 기반으로 구축된, 범용 음성 이해 프레임워크인 VibeVoice-ASR을 소개합니다. VibeVoice-ASR은 최근 단문 음성 인식 기술의 발전에도 불구하고 여전히 해결해야 할 과제인, 긴 형태의 오디오(예: 회의, 팟캐스트)에서 발생하는 문맥 단편화 및 다중 화자 복잡성 문제를 해결하도록 설계되었습니다. 기존의 파이프라인 방식과 달리, VibeVoice-ASR은 최대 60분 분량의 오디오를 단일 단계로 처리할 수 있습니다. 또한, 음성 인식, 화자 분리, 타임스탬프 기능을 하나의 통합된 엔드-투-엔드 생성 작업으로 통합합니다. 더불어, VibeVoice-ASR은 50개 이상의 언어를 지원하며, 명시적인 언어 설정 없이도 작동하며, 발화 내외부의 코드 스위칭을 자연스럽게 처리합니다. 또한, 프롬프트 기반의 문맥 주입 메커니즘을 도입하여 사용자가 맞춤형 문맥 정보를 제공할 수 있도록 하여, 특정 분야 용어 및 다음절 단어의 의미를 구별하는 정확도를 크게 향상시킵니다.

Original Abstract

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.

10 Citations

1 Influential

12 Altmetric

72.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!