2602.04913v1 Feb 04, 2026 cs.LG

A$^2$-LLM: 엔드 투 엔드 대화형 오디오 아바타 거대 언어 모델

A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

Cong Huang

Citations: 12

h-index: 3

Kai Chen

Citations: 443

h-index: 4

Hangjie Yuan

Citations: 214

h-index: 5

Xiaolin Hu

Citations: 58

h-index: 5

Xinzhu Sang

Citations: 107

h-index: 5

Binbin Yan

Citations: 304

h-index: 10

Zhou Yu

Citations: 17

h-index: 1

표현력이 풍부하고 반응성이 뛰어난 대화형 디지털 휴먼 개발은 차세대 인간-컴퓨터 상호작용의 핵심입니다. 거대 언어 모델(LLM)은 대화 능력을 크게 향상시켰지만, 현재 대부분의 시스템은 여전히 독립적인 모듈을 연결하는 파이프라인 구조에 의존합니다. 이러한 파이프라인은 종종 누적된 오류, 높은 지연 시간 및 낮은 실시간 성능 문제를 겪습니다. 기본 대화 컨텍스트에 대한 접근성이 부족하여, 이러한 파이프라인은 감정적인 깊이보다 엄격한 입술 동기화에 우선순위를 둡니다. 이러한 문제점을 해결하기 위해, 우리는 언어, 오디오 운율 및 3D 얼굴 움직임을 단일 프레임워크 내에서 통합적으로 추론하는 엔드 투 엔드 대화형 오디오 아바타 거대 언어 모델인 A$^2$-LLM을 제안합니다. 학습을 용이하게 하기 위해, 우리는 의미 의도를 표현적인 얼굴 움직임과 일치시키는 고품질 멀티모달 데이터셋인 FLAME-QA를 소개합니다. 심층적인 의미 이해를 활용하여, A$^2$-LLM은 단순한 입술 동기화 이상의 감정적으로 풍부한 얼굴 움직임을 생성합니다. 실험 결과는 제 시스템이 뛰어난 감정 표현력을 달성하면서도 실시간 효율성(500ms 지연 시간, 0.7 RTF)을 유지함을 보여줍니다.

Original Abstract

Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A$^2$-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A$^2$-LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!