2604.10367v1 Apr 11, 2026 cs.AI

단방향 대화를 넘어: 대화형 오디오 맥락 인지 커널을 활용한 상호작용형 말하기-듣기 아바타 생성

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

Xinyi Yu

Citations: 12

h-index: 2

Haotian Wang

Citations: 75

h-index: 5

Yuzhe Weng

Citations: 38

h-index: 3

Xiaoyan Wu

Citations: 41

h-index: 3

Jun Du

Citations: 10

h-index: 2

Haoran Xu

Citations: 6

h-index: 2

Shan He

Citations: 46

h-index: 3

오디오 기반 인간 비디오 생성은 특히 강력한 비디오 생성 기반 모델의 발전으로 인해 단방향 대화 시나리오에서 괄목할 만한 성과를 거두었습니다. 본 연구는 단방향 대화를 넘어, 진정한 인간의 의사소통은 본질적으로 양방향의 상호작용적인 과정이며, 가상 에이전트는 자신의 발화뿐만 아니라 들어오는 대화형 오디오에 자연스럽게 반응할 수 있어야 합니다. 기존 방법의 대부분은 기존 오디오 기반 패러다임을 듣기 시나리오로 확장하는 방식을 취합니다. 그러나 엄격한 프레임별 정렬에 의존하면 모델의 반응이 장거리 대화 흐름에 대해 경직될 수 있으며, 반면 전역 어텐션을 직접 도입하면 입술 동기화가 심각하게 저하됩니다. 우리는 말하기와 듣기 행동 간의 고유한 시간 척도 차이를 인식하고, 이 물리적 직관을 모델에 점진적인 시간적 유도 편향으로 명시적으로 주입하기 위해 멀티 헤드 가우시안 커널을 도입합니다. 이를 기반으로, 우리는 말하기와 듣기를 동시에 처리할 수 있는 양방향 상호작용 가상 에이전트를 구축합니다. 또한, 완벽하게 분리된 음성 및 배경 오디오 트랙을 특징으로 하는 정밀하게 관리된 대화형 데이터셋 VoxHear을 소개합니다. 광범위한 실험 결과, 제안하는 방법은 강력한 시간 정렬과 심층적인 문맥 의미를 성공적으로 결합하여, 매우 자연스럽고 반응적인 양방향 상호작용 디지털 인간을 생성하는 데 있어 새로운 최고 수준을 달성함을 보여줍니다. 프로젝트 페이지는 https://warmcongee.github.io/beyond-monologue/ 에서 확인할 수 있습니다.

Original Abstract

Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans. The project page is available at https://warmcongee.github.io/beyond-monologue/ .

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!