2409.14993v3 Sep 23, 2024 cs.AI

멀티모달 생성형 AI: 멀티모달 LLM, 디퓨전, 그리고 통합

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

Xin Wang

Citations: 1,101

h-index: 17

Yuwei Zhou

Citations: 291

h-index: 10

Bin Huang

Citations: 382

h-index: 5

Hong Chen

Citations: 1,817

h-index: 22

Wenwu Zhu

Citations: 3,672

h-index: 30

멀티모달 생성형 AI(인공지능)는 학계와 산업계 모두에서 점점 더 많은 주목을 받고 있다. 특히 두 가지 주요 기술 흐름이 부상했는데, 첫째는 멀티모달 이해에 있어 인상적인 능력을 보여주는 멀티모달 대규모 언어 모델(LLM)이며, 둘째는 멀티모달 생성 측면에서 탁월한 성능을 발휘하는 디퓨전(확산) 모델이다. 이에 본 논문은 멀티모달 LLM, 디퓨전 모델, 그리고 이해와 생성의 통합을 포함한 멀티모달 생성형 AI에 대한 포괄적인 개요를 제공한다. 통합 모델을 위한 견고한 기반을 마련하기 위해, 우선 확률적 모델링 절차, 멀티모달 아키텍처 설계, 이미지/비디오 LLM 및 텍스트 기반 이미지/비디오 생성으로의 응용을 포함하여 멀티모달 LLM과 디퓨전 모델을 각각 상세히 검토한다. 더 나아가 이해와 생성을 위한 통합 모델을 향한 최근의 연구 노력들을 탐구한다. 이해와 생성의 통합을 달성하기 위해 자기회귀(autoregressive) 기반 및 디퓨전 기반 모델링은 물론 Dense 및 전문가 혼합(MoE) 아키텍처를 포함한 핵심 설계를 조사한다. 이어 통합 모델을 위한 몇 가지 전략을 소개하고 잠재적인 장단점을 분석한다. 또한 멀티모달 생성형 AI 사전 학습에 널리 사용되는 주요 데이터셋을 요약한다. 마지막으로 멀티모달 생성형 AI의 지속적인 발전에 기여할 수 있는 몇 가지 도전적인 미래 연구 방향을 제시한다.

Original Abstract

Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions which may contribute to the ongoing advancement of multi-modal generative AI.

6 Citations

0 Influential

15 Altmetric

81.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!