2603.16936v1 Mar 14, 2026 cs.CV

TDMM-LM: 언어 모델을 활용하여 얼굴 인식과 애니메이션을 연결하는 방법

TDMM-LM: Bridging Facial Understanding and Animation via Language Models

Zhuoran Li

Citations: 0

h-index: 0

Luchuan Song

Citations: 468

h-index: 12

Jason J. Corso

Citations: 63

h-index: 2

Haiyang Liu

Citations: 44

h-index: 4

Zhenchao Jin

Citations: 460

h-index: 13

Yolo Yunlong Tang

Citations: 3

h-index: 1

Zichong Xu

Citations: 0

h-index: 0

Susan Liang

Citations: 569

h-index: 12

Jing Bi

Citations: 538

h-index: 13

Chenliang Xu

Citations: 365

h-index: 8

텍스트 기반 인간 신체 애니메이션은 빠르게 발전해 왔지만, 잘 주석이 달린 텍스트와 페어링된 얼굴 데이터의 부족으로 인해 얼굴 애니메이션은 여전히 뒤처져 있습니다. 이러한 격차를 해소하기 위해, 우리는 기초 생성 모델을 활용하여 대규모의 균형 잡힌 얼굴 동작 데이터셋을 생성합니다. 우리는 감정과 머리 움직임을 다루는 다양한 프롬프트 세트를 설계하고, 여러 생성기를 사용하여 약 80시간 분량의 얼굴 비디오를 생성하며, 각 프레임별 3D 얼굴 파라미터를 조정하여, 학습을 위한 대규모의 (프롬프트 및 파라미터) 쌍을 얻습니다. 이 데이터셋을 기반으로, 우리는 언어 모델이 두 가지 상호 보완적인 작업을 통해 얼굴 동작에 대한 양방향적인 이해 능력을 갖도록 합니다. (1) Motion2Language: 3D 얼굴 파라미터 시퀀스를 입력받아, 내용, 스타일 및 역학을 포괄하는 자연어 설명을 생성합니다. (2) Language2Motion: 프롬프트를 입력받아, 다운스트림 애니메이션을 위해 양자화된 동작 토큰을 사용하여 해당 3D 얼굴 파라미터 시퀀스를 합성합니다. 광범위한 실험 결과, 언어 모델은 이 설정에서 강력한 일반화 능력으로 얼굴 동작을 해석하고 합성할 수 있음을 보여줍니다. 현재까지 알려진 바로는, 본 연구는 얼굴 파라미터 모델링을 언어 문제로 간주하는 최초의 연구이며, 텍스트 기반 얼굴 애니메이션 및 동작 이해를 위한 통합적인 방법을 제시합니다.

Original Abstract

Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!